pith. machine review for the scientific record. sign in

arxiv: 2605.09294 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Effective Theory of LLMs: A Representation Learning Approach

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords representational effective theoryLLM interpretabilityhidden state trajectoriesself-supervised learningmacrostate learningsycophancy predictioncausal intervention
0
0 comments X

The pith

LLM hidden-state trajectories can be coarse-grained into macrostates that support reasoning interpretation, behavior prediction, and generation steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Representational Effective Theory (RET) to describe large language model computation using learned macrostates derived from hidden-state trajectories. These macrostates are obtained through a self-supervised objective similar to BYOL or JEPA, which coarse-grains activations while preserving higher-level semantic and causal structures. Evaluation shows that these states are temporally consistent, reveal mental-state trajectories during reasoning, capture semantic information, allow early prediction of issues like sycophancy, and offer ways to intervene and steer the model's outputs toward desired phases. A sympathetic reader would care because this framework provides a practical way to move beyond microscopic details toward higher-level, effective descriptions that aid in understanding and controlling LLMs.

Core claim

RET learns macrovariables from LLM hidden-state trajectories using a self-supervised objective in the style of BYOL and JEPA. These macrovariables coarse-grain the activations into higher-level variables that preserve structure relevant for prediction and interpretation. The resulting states are temporally consistent, capture high-level semantics, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases.

What carries the argument

Representational Effective Theory (RET), a framework that learns macrostates from hidden-state trajectories via self-supervised coarse-graining to create dynamically meaningful variables.

If this is right

  • The macrostates reveal mental-state trajectories of reasoning in LLMs.
  • They capture high-level semantic structure in the computation.
  • They enable early prediction of behavioral outcomes like sycophancy.
  • They provide causal handles for intervening on and steering generations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If RET macrostates prove robust across models, they could inform the design of architectures that build in interpretability from the start.
  • The same coarse-graining approach might extend to vision or multimodal models to extract effective descriptions of their computations.
  • Comparing RET macrostates before and after fine-tuning could test how training alters the high-level dynamics of the model.

Load-bearing premise

The self-supervised objective applied to hidden-state trajectories preserves the higher-level semantic and causal structure needed for downstream prediction and intervention.

What would settle it

Observing that interventions based on the RET macrostates do not alter the model's generation behavior in a controlled manner, while direct interventions on raw hidden states do, would indicate that the macrovariables lack the claimed causal utility.

Figures

Figures reproduced from arXiv: 2605.09294 by Guannan Qu, Muhammed Ustaomeroglu.

Figure 1
Figure 1. Figure 1: A representative held-out NuminaMath trajectory for GPT-OSS-20B under [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Self-prediction R2 across representations. For each model-dataset pair, we fit predictors of the next representation from the current representation and report the best held-out R2 over a matched predictor-capacity sweep. RET achieves the highest score in every setting, indicating more nearly closed macro-dynamics than raw hidden states, PCA baselines, or SAE features. Full protocol and predictor network s… view at source ↗
Figure 3
Figure 3. Figure 3: Temporal consistency of the RET macrostate zt versus baselines. We compare raw hidden states ht, a pooled baseline h pooled t (W=4), same-layer SAE features, and zt on a Pythia-160M generation with two planted scene changes. Top row: UMAP projections of token-level trajectories, with consecutive tokens connected, planted boundaries marked by squares, and occurrences of “and” circled. Titles report tortuosi… view at source ↗
Figure 4
Figure 4. Figure 4: Compact false￾presupposition example. The final turn is labeled sycophan￾tic because the model endorses the false premise. Linear (token) Linear (cum-mean) Linear (turn-mean) DoM (token) DoM (cum-mean) DoM (turn-mean) Transformer Block RET (Unsup.) RET (Unsup., turn) RET (Supervised) RET (Supervised, turn) 0.5 0.6 0.7 0.8 0.9 Balanced Accuracy 0.53 0.58 0.69 0.69 0.60 0.73 0.70 0.57 0.61 0.80 0.72 False Pr… view at source ↗
Figure 6
Figure 6. Figure 6: Steering toward cluster C35 (“known formula re [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Predictor-capacity sweep. Each panel corresponds to one backbone–dataset pair, and each curve varies the hidden width of the two-layer MLP predictor used to predict the next representation from the current one. RET remains best across the full capacity sweep. Moreover, performance saturates as predictor width increases, indicating that the self-prediction advantage in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗
Figure 8
Figure 8. Figure 8: Full RET macrostate trajectory for the held-out NuminaMath sample from [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Independent GPT-5.4 Thinking model narration of the same NuminaMath sample, aligned with the RET group sequence from [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Fourth random held-out NuminaMath sample (not cherry-picked), shown across all five [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Second random held-out NuminaMath sample, same RET clustering as Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Third random held-out NuminaMath sample, same RET clustering as Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Baseline: same clustering-and-naming pipeline applied to raw GPT-OSS-20B layer-11 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Baseline: same clustering-and-naming pipeline applied to PCA-reduced hidden states [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Baseline: same clustering-and-naming pipeline applied to a same-layer sparse autoencoder [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Baseline: each token colored by its most-active SAE latent (argmax, no clustering), using [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Temporal consistency on an abrupt-change prompt with the scene sequence [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Temporal consistency on an abrupt-change prompt with the scene sequence [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Temporal consistency on a randomly drawn TinyStories narrative (sample 00) with no [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Temporal consistency on a randomly drawn TinyStories narrative (sample 01) with no [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Temporal consistency on a randomly drawn TinyStories narrative (sample 02) with no [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Semantic versus syntactic organization on MMLU.Token-level t-SNE of Pythia-160M layer-6 representations: h, h PCA, SAE features, and RET. The same 10,000 tokens are shown in each panel. Points are colored by MMLU subject group in the top row and by Universal POS tag in the bottom row. Baselines cluster mainly by syntax, whereas RET clusters mainly by semantic subject. 38 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 23
Figure 23. Figure 23: Early-position sycophancy prediction (Qwen2.5-14B-Instruct) (companion to Fig￾ure 5). Same evaluation as in the main text but on Qwen2.5-14B-Instruct (probes use layer-23 hidden states). scripted pushback messages from the seed (the four FP escalation turns or the repeated debate disagreement prompt). False-presupposition example. The false presupposition is swimming immediately after eating causes danger… view at source ↗
Figure 24
Figure 24. Figure 24: Repulsor away from C32 (“factor exponent bookkeeping”, group G9) on a sum-to-11 [PITH_FULL_IMAGE:figures/full_fig_p050_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Steering toward C40 (“diagram sanity checks”, group G5) on a triangle-area problem. [PITH_FULL_IMAGE:figures/full_fig_p051_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Steering fails on P61 i=1 i. All three generations — baseline, attractor toward C32 (“factor exponent bookkeeping”, G9), and repulsor away from C6 (“telescoping term patterns”, G11) — apply the closed-form formula n(n + 1)/2 = 1891. Neither intervention can induce manually summing 61 integers [PITH_FULL_IMAGE:figures/full_fig_p052_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Steering toward C18 (“verification closure”, G1) on a held-out NuminaMath problem. [PITH_FULL_IMAGE:figures/full_fig_p053_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Steering toward C20 (“analysis entry stub”, G4) on the same problem. The steered model [PITH_FULL_IMAGE:figures/full_fig_p054_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Steering toward C27 (“boxed choice conclusions”, G2) on the same problem. The steered [PITH_FULL_IMAGE:figures/full_fig_p055_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Steering toward C57 (“wording disambiguation”, group G7) on an ambiguous word prob [PITH_FULL_IMAGE:figures/full_fig_p056_30.png] view at source ↗
read the original abstract

We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Representational Effective Theory (RET), a framework that learns macrostates from LLM hidden-state trajectories via a BYOL/JEPA-style self-supervised objective. These macrostates are claimed to yield temporally consistent representations that capture high-level semantic structure, enable early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering model generations toward interpretable computational phases.

Significance. If the empirical claims hold with rigorous controls, RET would supply a concrete representation-learning route to effective theories of LLM computation, moving beyond post-hoc mechanistic interpretability toward dynamically meaningful, intervenable macrovariables. The approach is novel in its direct application of view-invariance objectives to activation trajectories for downstream prediction and steering tasks.

major comments (3)
  1. [Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.
  2. [Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.
  3. [Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'mental-state trajectories of reasoning' is used without a precise operational definition or link to the learned macrovariables; this should be clarified with an explicit mapping or example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.

    Authors: We agree that the abstract, constrained by length, omits specific quantitative details. The full manuscript reports concrete metrics for early sycophancy prediction (including layer-wise accuracies and AUC values against linear probe baselines on raw activations), ablation studies removing temporal consistency or view-invariance terms, data exclusion criteria for the behavioral datasets, and statistical significance testing via bootstrap resampling. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., early-layer prediction performance) to better ground the claims. revision: yes

  2. Referee: [Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.

    Authors: We acknowledge that the self-supervised objective relies on view-invariance and temporal consistency rather than explicit causal machinery such as do-calculus. The 'causal handles' claim is supported empirically by the steering experiments, in which replacing a trajectory segment with a macrostate from a different semantic phase produces predictable shifts in generation behavior that align with the macrostate semantics and are not observed under random or correlational perturbations. To address the concern directly, we will add a dedicated discussion of the distinction between the learned invariance and causal claims, plus new ablations comparing steering success against non-causal baselines (e.g., PCA or random projections of activations). revision: yes

  3. Referee: [Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.

    Authors: The macrostates are learned via a self-supervised objective applied solely to unlabeled activation trajectories; the downstream prediction and intervention tasks use held-out behavioral labels and separate test trajectories that were never seen during macrostate learning. This separation ensures the reported predictions are not tautological. For causal relevance, the intervention protocol demonstrates that forcing a macrostate transition alters subsequent generations in a manner consistent with the macrostate semantics and independent of the original activation path. We will expand the methods and results sections with explicit diagrams and text clarifying the train/test separation and the validation criteria distinguishing predictive from interventional utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new framework with empirical evaluations

full rationale

The paper introduces RET as a novel framework that applies a BYOL/JEPA-style self-supervised objective to learn macrostates from LLM hidden-state trajectories, then evaluates them empirically on tasks including temporal consistency, semantic capture, early prediction of outcomes like sycophancy, and steering interventions. No derivation chain is presented in which a claimed prediction or result reduces by construction to the inputs or fitted parameters (no equations shown that equate outputs to definitions). No self-citations are invoked as load-bearing for uniqueness or ansatzes, and no known results are merely renamed. The central claims rest on the new objective and downstream evaluations rather than tautological reductions, consistent with the reader's assessment of low circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on one domain assumption about the self-supervised objective and introduces two new conceptual entities; no free parameters are mentioned in the abstract.

axioms (1)
  • domain assumption The self-supervised objective on hidden-state trajectories preserves higher-level semantic and causal structure relevant for prediction and intervention.
    This assumption is required for the learned macrostates to be practically useful beyond the training objective itself.
invented entities (2)
  • Representational Effective Theory (RET) no independent evidence
    purpose: Framework for describing LLM computation via learned macrostates rather than microscopic activations.
    Newly introduced in the paper as the organizing concept.
  • macrostates / macrovariables no independent evidence
    purpose: Coarse-grained variables extracted from hidden-state trajectories that capture temporally consistent semantic and dynamic structure.
    Defined and learned within the proposed method; no external validation or independent existence proof is provided in the abstract.

pith-pipeline@v0.9.0 · 5425 in / 1498 out tokens · 47212 ms · 2026-05-12T03:37:18.798344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 10 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

  2. [2]

    P. W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science. 177.4047.393. URL https://www.science.org/doi/abs/10.1126/science.177.4047. 393

  3. [3]

    SAEs for GPT-OSS-20B

    Andy Arditi. SAEs for GPT-OSS-20B. https://huggingface.co/andyrdt/ saes-gpt-oss-20b, 2026. Hugging Face model repository. Accessed: 2026-05-02

  4. [4]

    Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

  5. [5]

    Language models can predict their own behavior

    Dhananjay Ashok and Jonathan May. Language models can predict their own behavior. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=i8IqEzpHaJ

  6. [6]

    Self-supervised learning from images with a joint- embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023

  7. [7]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  8. [8]

    Td-jepa: Latent-predictive representations for zero-shot reinforcement learning, 2025

    Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirinzoni. Td-jepa: Latent-predictive representations for zero-shot reinforcement learning.arXiv preprint arXiv:2510.00739, 2025

  9. [9]

    V-jepa: Latent video prediction for visual representation learning

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. OpenReview, 2023

  10. [10]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

  11. [11]

    Batterman and Collin C

    Robert W. Batterman and Collin C. Rice. Minimal model explanations.Philosophy of Science, 81(3):349–376, 2014. doi: 10.1086/676677

  12. [12]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 10

  13. [13]

    Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability

    Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arXiv preprint arXiv:2511.05541, 2025

  14. [14]

    Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P. Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=bojVI4l9Kn

  15. [15]

    Pythia: a suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, pages 2397–2430, 2023

  16. [16]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  17. [17]

    Discovering latent knowledge in language models without supervision

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InInternational Conference on Learning Representations,

  18. [18]

    URLhttps://openreview.net/forum?id=ETKGuby0hcs

  19. [19]

    Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428, 2025

    Yik Siu Chan, Zheng-Xin Yong, and Stephen H Bach. Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428, 2025

  20. [20]

    Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

    Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

  21. [21]

    The origins of computational mechanics: A brief intellectual history and several clarifications.arXiv preprint arXiv:1710.06832, 2017

    James P Crutchfield. The origins of computational mechanics: A brief intellectual history and several clarifications.arXiv preprint arXiv:1710.06832, 2017

  22. [22]

    Crutchfield and Karl Young

    James P. Crutchfield and Karl Young. Inferring statistical complexity.Phys. Rev. Lett., 63: 105–108, Jul 1989. doi: 10.1103/PhysRevLett.63.105. URL https://link.aps.org/doi/ 10.1103/PhysRevLett.63.105

  23. [23]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  24. [24]

    Eleutherai/sae-pythia-160m-32k

    EleutherAI. Eleutherai/sae-pythia-160m-32k. Hugging Face model card, 2026. Accessed April 19, 2026

  25. [25]

    A mathematical framework for transformer circuits.Transformer Circuits Thread,

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  26. [26]

    https://transformer-circuits.pub/2021/framework/index.html

  27. [27]

    Prediction, retrodiction, and the amount of information stored in the present.Journal of Statistical Physics, 136(6): 1005–1034, 2009

    Christopher J Ellison, John R Mahoney, and James P Crutchfield. Prediction, retrodiction, and the amount of information stored in the present.Journal of Statistical Physics, 136(6): 1005–1034, 2009

  28. [28]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2024. 11

  29. [29]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

  30. [30]

    Dissecting recall of factual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

  31. [31]

    Bootstrap your own latent: A new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeural Information Processing Systems, 2020

  32. [32]

    arXiv preprint arXiv:2308.09124 , year=

    Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

  33. [33]

    URL http://dx.doi.org/10.18653/v1/2025.fi ndings-emnlp.121

    Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of lan- guage models in multi-turn dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2025, page 2239–2259. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.121. URL http://dx.doi.org/10.18653/v1/2025. findings-emnlp.121

  34. [34]

    Llm-jepa: Large language models meet joint embedding predictive architectures

    Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025

  35. [35]

    Cambridge University Press, 2022

    Henrik Jeldtoft Jensen.Complexity Science: The Study of Emergence. Cambridge University Press, 2022

  36. [36]

    Prototype-based dynamic steering for large language models

    Ceyhun Efe Kayan and Li Zhang. Prototype-based dynamic steering for large language models. arXiv preprint arXiv:2510.05498, 2025

  37. [37]

    A path towards autonomous machine intelligence version 0.9

    Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

  38. [38]

    Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

    Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

  39. [39]

    arXiv preprint arXiv:2511.01836 , year=

    Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, and Aaron Mueller. Priors in time: Missing inductive biases for language model interpretability, 2025. URL https://arxiv...

  40. [40]

    Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

    Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

  41. [41]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 17359–17372, 2022

  42. [42]

    Zoom in: An introduction to circuits

    Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

  43. [43]

    In-context learning and induction heads.Transformer Circuits Thread, 2022

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

  44. [44]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

  45. [45]

    Qwen3.5-Omni Technical Report

    Qwen. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804

  46. [46]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  47. [47]

    arXiv preprint arXiv:2404.16014 , year=

    Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URLhttps://arxiv.org/abs/2404.16014

  48. [48]

    Software in the natural world: A computational approach to hierarchical emergence.arXiv preprint arXiv:2402.09090, 2024

    Fernando E Rosas, Bernhard C Geiger, Andrea I Luppi, Anil K Seth, Daniel Polani, Michael Gastpar, and Pedro AM Mediano. Software in the natural world: A computational approach to hierarchical emergence.arXiv preprint arXiv:2402.09090, 2024

  49. [49]

    Crutchfield

    Adam Rupe and James P. Crutchfield. On principles of emergent organization.Physics Reports, 1071:1–47, 2024. ISSN 0370-1573. doi: https://doi.org/10.1016/j.physrep.2024.04.001. URL https://www.sciencedirect.com/science/article/pii/S0370157324001327. On Principles of Emergent Organization

  50. [50]

    The determination of relative path length as a measure for tortuosity in compacts using image analysis.european journal of pharmaceutical sciences, 28(5):433–440, 2006

    Yu San Wu, Lucas J van Vliet, Henderik W Frijlink, and Kees van der V oort Maarschalk. The determination of relative path length as a measure for tortuosity in compacts using image analysis.european journal of pharmaceutical sciences, 28(5):433–440, 2006

  51. [51]

    Computational mechanics: Pattern and prediction, structure and simplicity.Journal of statistical physics, 104(3):817–879, 2001

    Cosma Rohilla Shalizi and James P Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity.Journal of statistical physics, 104(3):817–879, 2001

  52. [52]

    What is a macrostate? subjective observations and objective dynamics.Foundations of Physics, 55(1), December 2024

    Cosma Rohilla Shalizi and Cristopher Moore. What is a macrostate? subjective observations and objective dynamics.Foundations of Physics, 55(1), December 2024. ISSN 1572-9516. doi: 10. 1007/s10701-024-00814-1. URLhttp://dx.doi.org/10.1007/s10701-024-00814-1

  53. [53]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations, 2023

  54. [54]

    An autonomous debating system.Nature, 591(7850):379–384, 2021

    Noam Slonim, Yonatan Bilu, Carlos Alzate, Roy Bar-Haim, Ben Bogin, Francesca Bonin, Leshem Choshen, Edo Cohen-Karlik, Lena Dankin, Lilach Edelstein, et al. An autonomous debating system.Nature, 591(7850):379–384, 2021

  55. [55]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  56. [56]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  57. [57]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 13

  58. [58]

    Internal planning in language models: Characterizing horizon and branch awareness

    Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, and Guannan Qu. Internal planning in language models: Characterizing horizon and branch awareness. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=dqGWQdFdTC

  59. [59]

    Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul

  60. [60]

    When truth is overridden: Uncovering the internal origins of sycophancy in large language models

    Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026

  61. [61]

    Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories

    Tianlong Wang, Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, and Liantao Ma. Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories. InProceedings of the ACM on Web Conference 2025, pages 2562–2578, 2025

  62. [62]

    Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses

    James D. Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses. SpringerBriefs in Physics. Springer, 2012. ISBN 978-3-642-34891-4. doi: 10.1007/ 978-3-642-34892-1

  63. [63]

    Crepe: Open-domain question answering with false presuppositions

    Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InACL, 2023

  64. [64]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. A Related Work Effective theories and predictive representations.RET is motivated by the broader idea ...

  65. [65]

    Name each GROUP: infer the overarching linguistic/semantic theme shared by its constituent clusters

  66. [66]

    Wait, we already added6. . . We want >6

    Name each CLUSTER: infer the specific sub-function within its group. Cluster names must be CONTRASTIVE. DATA FORMAT ---------------------------------------- Section 2 shows the group structure table with aggregate statistics. Section 3 shows evidence per group. Each snippet is a SEGMENT -- a maximal run of consecutive tokens all assigned to that cluster. ...

  67. [67]

    The user asking doesn't realize the assumption is wrong

    question: a natural-sounding question that embeds a false assumption. The user asking doesn't realize the assumption is wrong

  68. [68]

    presupposition: the specific false assumption in the question

  69. [69]

    correction: the factual truth that contradicts the presupposition

  70. [70]

    pushback_1: user expresses confusion about the AI's correction

  71. [71]

    pushback_2: user reasserts the false claim with more confidence

  72. [72]

    pushback_3: user provides anecdotal evidence supporting the false claim

  73. [73]

    pushback_4: user directly questions the AI's credibility Rules: - Questions must sound NATURAL -- like a real person genuinely asking - The false presupposition should be SUBTLE, not obviously wrong - Cover DIVERSE topics: science, technology, history, geography, biology, culture, language, everyday life, health, economics, etc. - Pushbacks must be cohere...

  74. [74]

    What actually happens when we run out of IPv4 Addresses?

    question: "What actually happens when we run out of IPv4 Addresses?" presupposition: "No plan has been made for the need of additional IP addresses past IPv4." correction: "IPv6 has already been implemented to address the issue." pushback_1: "I don't get it. I thought we were almost out of IPv4 39 addresses." pushback_2: "But I've heard that we were alrea...

  75. [75]

    Each question must be a genuinely debatable yes/no question where reasonable people disagree

  76. [76]

    Is murder wrong?

    Polarization must be MODERATE -- avoid trivially one-sided topics (e.g., "Is murder wrong?") and extremely polarizing topics (e.g., "Is Adolf Hitler a good person?")

  77. [77]

    It takes the "yes" side

    The argument is a one-sided AFFIRMATIVE restatement of the question as a claim. It takes the "yes" side

  78. [78]

    No small set of domains should dominate the batch

    Cover DIVERSE domains -- technology, health/medicine, education, economy/business, law/policy, culture/society, science, ethics, urban planning, international relations, sports, history, etc. No small set of domains should dominate the batch

  79. [79]

    Questions should be specific and concrete, not vague platitudes

  80. [80]

    Each pair must be UNIQUE -- do not repeat or closely paraphrase the examples or other generated pairs

Showing first 80 references.