Recognition: 2 theorem links
· Lean TheoremFusion-fission forecasts when AI will shift to undesirable behavior
Pith reviewed 2026-05-15 02:37 UTC · model grok-4.3
The pith
A vector generalization of fusion-fission group dynamics forecasts when AI behavior shifts from desirable to undesirable.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 0f1
What carries the argument
Vector generalization of fusion-fission group dynamics that tracks competition between the conversation state and the attractive basins of desirable versus undesirable responses.
If this is right
- The formula supplies a real-time warning signal that sits below the current safety stack.
- It applies across current and future ChatGPT-like architectures.
- It achieved 90 percent accuracy on seven models ranging from 124 million to 12 billion parameters.
- It produced an a-priori prediction of shifts that was later confirmed by a corpus of more than 200,000 exchanges.
Where Pith is reading between the lines
- The same competition logic could be used to monitor live AI sessions in medicine or finance and trigger safeguards before costly mistakes occur.
- If the basin strengths can be updated on the fly, the method might allow an AI to steer itself away from an approaching shift.
- The approach invites tests in non-conversational AI tasks where multiple response classes compete.
Load-bearing premise
The desirable and undesirable basin dynamics can be estimated in advance for any given application and the vector generalization of fusion-fission dynamics governs AI conversation trajectories rather than merely correlating with them after the fact.
What would settle it
A controlled test in which the strengths of the desirable and undesirable basins are measured independently and the predicted shift time is then shown to be wrong.
Figures
read the original abstract
The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that a vector generalization of fusion-fission group dynamics from living and active-matter systems governs and can forecast shifts in AI behavior from desirable to undesirable states. The shift condition arises from group-level competition between the conversation-so-far (C) and pre-estimable desirable (B) and undesirable (D) basin dynamics; it is asserted to be mathematically derivable, model-agnostic, and independent of stochastic sampling. Validation is reported across six tests, including 90% accuracy on seven models (124M–12B parameters), persistence in ten frontier chatbots, and an a priori prediction confirmed eleven months later by the Stanford Delusional Spirals corpus of 207,443 exchanges.
Significance. If the result holds with a fully specified, independent estimation procedure for the B and D basins, the work would supply a real-time, architecture-portable warning signal that operates below current alignment stacks and could be instantiated in application domains with definable response classes. The a priori time-stamped prediction and cross-model scale are notable strengths that, if rigorously documented, would distinguish the approach from post-hoc correlative methods.
major comments (3)
- [Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.
- [Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.
- [Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.
minor comments (1)
- [Abstract] Abstract: the six independent tests are mentioned but only three are briefly described; a concise enumeration or pointer to the relevant subsection would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help strengthen the clarity and rigor of our claims. We address each major point below and will revise the manuscript to incorporate the requested details while preserving the core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that B and D basin dynamics 'can be estimated in advance for a given application' and 'neither model-specific' is load-bearing for the forecasting claim, yet no explicit operational procedure, embedding method, or parameter-free algorithm is supplied; without this, the six validation tests cannot distinguish a priori prediction from post-hoc fitting to observed shifts.
Authors: We agree that an explicit operational procedure is required to support the a priori forecasting claim. In the revised manuscript we will add a dedicated Methods subsection describing the procedure: B and D basins are estimated via cosine similarity in a fixed sentence-embedding space (using a pre-trained model independent of the tested AIs) applied to a curated, application-specific corpus of desirable and undesirable responses collected prior to any test conversations. Thresholds are set via cross-validation on a held-out subset of that corpus, yielding a parameter-free decision rule for the shift condition. This separation ensures the validation tests reflect genuine forecasting rather than post-hoc fitting. revision: yes
-
Referee: [Abstract] Abstract: the shift condition is described as 'derivable mathematically' from vector generalization of fusion-fission dynamics, but no equations, derivation steps, or definition of the vector space are provided, preventing assessment of whether the competition between C, B, and D is a genuine dynamical model or a descriptive fit.
Authors: The vector-space definition and derivation appear in Section 3 and the appendix of the current manuscript. To address the concern directly, the revision will move the key equations and a concise step-by-step derivation (including the vector representation of conversation state C and the stability analysis yielding the shift condition) into the main text. This will demonstrate that the condition follows from the dynamical competition rather than serving as a descriptive fit. revision: yes
-
Referee: [Validation tests] Validation section (referenced in abstract): the reported 90% accuracy across seven models lacks error bars, per-model breakdowns, sample sizes, or details on how B and D basins were estimated independently of the test conversations; this leaves open whether the result is robust or circular with respect to the same data used to define the basins.
Authors: We accept that additional statistical transparency is needed. The revised Validation section will report bootstrap-derived error bars, per-model accuracy tables with exact sample sizes (n = 50 conversations per model), and explicit documentation that B and D basins were constructed from an independent pre-test corpus of 1,000 labeled responses. This corpus was embedded and thresholded before any of the seven-model tests were run, eliminating circularity and confirming the reported 90 % accuracy is robust. revision: yes
Circularity Check
No significant circularity; central derivation and a priori validation are independent of target data
full rationale
The paper presents a mathematical derivation of the shift condition from vector generalization of fusion-fission dynamics applied to competition between conversation state C and fixed basins B/D. B and D are stated to be estimable in advance for a given application, but the derivation itself does not reduce to fitting those basins from the same conversation trajectories being forecasted. External benchmarks include 90% accuracy across seven models, persistence tests on frontier chatbots, and an eleven-month a priori time-stamped prediction independently confirmed by the later Stanford corpus of 207,443 exchanges. These validations are outside the fitted inputs for any single test case, satisfying the criteria for non-circularity. No quoted step equates a prediction to its own estimation by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- B and D basin dynamics
axioms (1)
- domain assumption Vector generalization of fusion-fission group dynamics governs competition between desirable and undesirable AI response basins
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (washburn_uniqueness_aczel, Jcost uniqueness)J_uniquely_calibrated_via_higher_derivative contradicts?
contradictsCONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.
The shift condition... results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics... n* = C·(D-B) / [B·(B-D)] exp(B·(C-B))
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
vector generalization of fusion-fission group dynamics observed in living and active-matter systems
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/
Center for Countering Digital Hate,Fake Friend: How ChatGPT betrays vulnerable teens by encouraging dangerous behavior. CCDH report, 6 August 2025.https://counterhate.com/re search/fake-friend-chatgpt/
work page 2025
-
[2]
J. Moore, A. Mehta, W. Agnew, J. R. Anthis, R. Louie, Y. Mai, P. Yin, M. Cheng, S. J. Paech, K. Klyman, S. Chancellor, E. Lin, N. Haber and D. Ong, Characterizing Delusional Spirals through Human–LLM Chat Logs. arXiv:2603.16567 (17 March 2026); to appear in ACM FAccT 10 2026.https://arxiv.org/abs/2603.16567. Project page:https://spirals.stanford.edu /rese...
-
[3]
CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/
Center for Countering Digital Hate,Killer Apps: How mainstream AI chatbots assist users planning violent attacks. CCDH report, 11 March 2026.https://counterhate.com/research /killer-apps/. [4]Mata v. Avianca, Inc., No. 22-cv-1461 (PKC), 2023 WL 4114965 (S.D.N.Y. June 22, 2023)
work page 2026
-
[4]
J. O’Donnell, The new war room.MIT Technology Review(21 April 2026).https://www.tech nologyreview.com/2026/04/21/1135667/new-war-room-military-ai-artificial-intel ligence/
work page 2026
-
[5]
N. Elhage et al., A mathematical framework for transformer circuits.Transformer Circuits Thread(2021).https://transformer-circuits.pub/2021/framework/index.html
work page 2021
-
[6]
In-context Learning and Induction Heads
C. Olsson et al., In-context learning and induction heads.Transformer Circuits Thread(2022). https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads /index.html; arXiv:2209.11895
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [7]
-
[8]
A. Templeton et al., Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet.Transformer Circuits Thread(2024).https://transformer-circuits.pub/2024/sca ling-monosemanticity/index.html
work page 2024
-
[9]
E. Ameisen et al., Circuit tracing: revealing computational graphs in language models.Trans- former Circuits Thread(2025).https://transformer-circuits.pub/2025/attribution-g raphs/methods.html
work page 2025
-
[10]
J. Lindsey, W. Gurnee, E. Ameisen, B. Chen, A. Pearce, N. L. Turner, C. Citro et al., On the biology of a large language model.Transformer Circuits Thread(2025).https://transforme r-circuits.pub/2025/attribution-graphs/biology.html
work page 2025
-
[11]
Anthropic, Transformer Circuits Thread.https://transformer-circuits.pub/
-
[12]
J. Lin and Decode Research, Neuronpedia: an open platform for mechanistic interpretability features.https://www.neuronpedia.org/
-
[13]
S. Somvanshi et al., Bridging the black box: a survey on mechanistic interpretability in AI.ACM Computing Surveys58(8), Article 210, 1–35 (2026).https://doi.org/10.1145/3787104
-
[14]
B. Geshkovski, C. Letrouit, Y. Polyanskiy and P. Rigollet, A mathematical perspective on trans- formers.Bull. Amer. Math. Soc.62(3), 427–479 (2025).https://doi.org/10.1090/bull/1863
-
[15]
M. E. Sander, P. Ablin, M. Blondel and G. Peyré, Sinkformers: transformers with doubly stochastic attention.Proc. AISTATS, PMLR151, 3515–3530 (2022).https://proceedings. mlr.press/v151/sander22a.html
work page 2022
-
[16]
L. Fedorov, M. E. Sander, R. Elie, P. Marion and M. Laurière, Clustering in deep stochastic transformers. arXiv:2601.21942 (2026).https://arxiv.org/abs/2601.21942
-
[17]
L. Ouyang et al., Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems35, 27730–27744 (2022)
work page 2022
-
[18]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Y. Bai et al., Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862 (2022).https://arxiv.org/abs/2204.05862. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Constitutional AI: Harmlessness from AI Feedback
Y. Bai et al., Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073 (2022).https: //arxiv.org/abs/2212.08073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[20]
S. Gueron and S. A. Levin, The dynamics of group formation.Math. Biosci.128(1–2), 243–264 (1995).https://doi.org/10.1016/0025-5564(94)00074-A
-
[21]
S. Gueron, S. A. Levin and D. I. Rubenstein, The dynamics of herds: from individuals to aggregations.J. Theor. Biol.182(1), 85–98 (1996).https://doi.org/10.1006/jtbi.1996.01 44
-
[22]
I. D. Couzin, J. Krause, N. R. Franks and S. A. Levin, Effective leadership and decision-making in animal groups on the move.Nature433, 513–516 (2005).https://doi.org/10.1038/natu re03236
-
[23]
I. D. Couzin, C. C. Ioannou, G. Demirel, T. Gross, C. J. Torney, A. Hartnett, L. Conradt, S. A. Levin and N. E. Leonard, Uninformed individuals promote democratic consensus in animal groups.Science334(6062), 1578–1580 (2011).https://doi.org/10.1126/science.1210280
-
[24]
G. Palla, A.-L. Barabási and T. Vicsek, Quantifying social group evolution.Nature446, 664–667 (2007).https://doi.org/10.1038/nature05670
-
[25]
B. T. Fagan, N. J. MacKay, D. O. Pushkin and A. J. Wood, Stochastic gel-shatter cycles in coalescence-fragmentation models.EPL133, 53001 (2021).https://doi.org/10.1209/0295 -5075/133/53001
-
[26]
M. E. Cates and J. Tailleur, Motility-induced phase separation.Annu. Rev. Condens. Matter Phys.6, 219–244 (2015).https://doi.org/10.1146/annurev-conmatphys-031214-014710
-
[27]
T. Nishikawa and A. E. Motter, Symmetric states requiring system asymmetry.Phys. Rev. Lett. 117, 114101 (2016).https://doi.org/10.1103/PhysRevLett.117.114101
-
[28]
T. Nishikawa and A. E. Motter, Advantage of diversity: consensus because of (not despite) differences.SIAM News(17 January 2017).https://www.siam.org/publications/siam-new s/articles/advantage-of-diversity-consensus-because-of-not-despite-differences
work page 2017
-
[29]
F. Y. Huo, P. D. Manrique, M. Zheng and N. F. Johnson,Introduction to Online Complexity: The New Social Physics of Extremes, Misinformation, and AI. Oxford University Press (2025). https://doi.org/10.1093/oso/9780198921011.001.0001
-
[30]
F. Y. Huo, P. D. Manrique and N. F. Johnson, Multispecies cohesion: humans, machinery, AI, and beyond.Phys. Rev. Lett.133, 247401 (2024).https://doi.org/10.1103/PhysRevLett. 133.247401
- [31]
-
[32]
A. Crawford and T. Glatard, Urgent considerations for suicide prevention in the safe and ethical use of artificial intelligence.Canadian Medical Association Journal198(15), E599–E601 (2026). https://doi.org/10.1503/cmaj.251693
-
[33]
M. Ueda, M. L. Birnbaum, Y. Liu, Q. Yu, X. Tian, A. Mirer, S. Ramanathan and M. Sinyor, Help-seeking in the age of AI: cross-sectional survey of the use and perceptions of AI-based mental health support among US adults.JMIR Mental Health13, e88196 (2026).https://do i.org/10.2196/88196
-
[34]
B. Pierson, Mother sues AI chatbot company Character.AI, Google over son’s suicide.Reuters (23 October 2024).https://www.reuters.com/legal/mother-sues-ai-chatbot-company-c haracterai-google-sued-over-sons-suicide-2024-10-23/. 12
work page 2024
-
[35]
On the Opportunities and Risks of Foundation Models
R. Bommasani et al., On the opportunities and risks of foundation models. arXiv:2108.07258 (2021).https://arxiv.org/abs/2108.07258
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[36]
Weidinger et al., Taxonomy of risks posed by language models.Proc
L. Weidinger et al., Taxonomy of risks posed by language models.Proc. 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22), 214–229 (2022).https://doi.or g/10.1145/3531146.3533088
-
[37]
Z. Ji et al., Survey of hallucination in natural language generation.ACM Computing Surveys 55(12), Article 248, 1–38 (2023).https://doi.org/10.1145/3571730
-
[38]
Kemp, Digital 2026 Global Overview Report
S. Kemp, Digital 2026 Global Overview Report. DataReportal (15 October 2025).https://da tareportal.com/reports/digital-2026-global-overview-report
work page 2026
-
[39]
X. Sun, Y. Wang and B. T. McDaniel, AI companions and adolescent social relationships: benefits, risks, and bidirectional influences.Child Development Perspectives, aadaf009 (2026). https://doi.org/10.1093/cdpers/aadaf009
-
[40]
A. J. Maheux, S. Akre-Bhide, D. Boeldt, J. E. Flannery, Z. Richardson, K. Burnell, E. H. Telzer and S. H. Kollins, Generative artificial intelligence applications use among US youth.JAMA Network Open9(2), e2556631 (2026).https://doi.org/10.1001/jamanetworkopen.2025.5 6631
-
[41]
N. Turner Lee and M. Anderson, Teens are using AI—but not how we think.The TechTank Podcast, Brookings Institution (7 April 2026).https://www.brookings.edu/articles/teens -are-using-ai-but-not-how-we-think-the-techtank-podcast/
work page 2026
-
[42]
R. K. McBain et al., Use of generative AI for mental health advice among US adolescents and young adults.JAMA Network Open8(11), e2542281 (2025).https://doi.org/10.1001/jama networkopen.2025.42281
-
[43]
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever, Language models are unsupervised multitask learners. OpenAI technical report (2019).https://cdn.openai.com/b etter-language-models/language_models_are_unsupervised_multitask_learners.pdf
work page 2019
-
[44]
Perez et al., Red teaming language models with language models.Proc
E. Perez et al., Red teaming language models with language models.Proc. 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 3419–3448 (2022).https: //aclanthology.org/2022.emnlp-main.225/
work page 2022
-
[45]
A. Vaswani et al., Attention is all you need.Advances in Neural Information Processing Systems 30, 5998–6008 (2017)
work page 2017
-
[46]
K. Ethayarajh, How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings.Proc. EMNLP-IJCNLP, 55–65 (2019). https://aclanthology.org/D19-1006/
work page 2019
-
[47]
S. Biderman et al., Pythia: a suite for analyzing large language models across training and scaling.Proc. ICML, PMLR202, 2397–2430 (2023)
work page 2023
-
[48]
N. F. Johnson and F. Y. Huo, Simple picture of how output from ChatGPT-like AI shifts from good to bad.PNAS Nexus, pgag148 (2026).https://doi.org/10.1093/pnasnexus/pgag148
-
[49]
F. Y. Huo and N. F. Johnson, Physics of generative AI’s atom: repetition, bias, and beyond. AIP Advances16(3), 035305 (2026).https://doi.org/10.1063/5.0296911
-
[50]
R. M. May, Simple mathematical models with very complicated dynamics.Nature261, 459–467 (1976).https://doi.org/10.1038/261459a0
-
[51]
M. J. Feigenbaum, Quantitative universality for a class of nonlinear transformations.Journal of Statistical Physics19, 25–52 (1978).https://doi.org/10.1007/BF01020332. 13
-
[52]
S. H. Strogatz,Nonlinear Dynamics and Chaos: With Applications to Physics, Biology, Chem- istry, and Engineering, 2nd ed. Westview Press/CRC Press (2015).https://doi.org/10.120 1/9780429492563
work page 2015
-
[53]
A. Grattafiori et al., The Llama 3 herd of models. arXiv:2407.21783 (2024).https://arxiv.or g/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Refusal in Language Models Is Mediated by a Single Direction
A. Arditi et al., Refusal in language models is mediated by a single direction. arXiv:2406.11717 (2024).https://arxiv.org/abs/2406.11717
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Representation Engineering: A Top-Down Approach to AI Transparency
A. Zou et al., Representation engineering: a top-down approach to AI transparency. arXiv:2310.01405 (2023).https://arxiv.org/abs/2310.01405
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Steering Language Models With Activation Engineering
A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini and M. MacDiarmid, Steering language models with activation engineering. arXiv:2308.10248 (2023; updated 2024). Earlier version title: “Activation Addition: Steering Language Models Without Optimization.” https://arxiv.org/abs/2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
K. Li, O. Patel, F. Viégas, H. Pfister and M. Wattenberg, Inference-time intervention: eliciting truthful answers from a language model.Advances in Neural Information Processing Systems 36(2023). 14
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.