arxiv: 2605.09294 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Effective Theory of LLMs: A Representation Learning Approach

Muhammed Ustaomeroglu , Guannan Qu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords representational effective theoryLLM interpretabilityhidden state trajectoriesself-supervised learningmacrostate learningsycophancy predictioncausal intervention

0 comments

The pith

LLM hidden-state trajectories can be coarse-grained into macrostates that support reasoning interpretation, behavior prediction, and generation steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Representational Effective Theory (RET) to describe large language model computation using learned macrostates derived from hidden-state trajectories. These macrostates are obtained through a self-supervised objective similar to BYOL or JEPA, which coarse-grains activations while preserving higher-level semantic and causal structures. Evaluation shows that these states are temporally consistent, reveal mental-state trajectories during reasoning, capture semantic information, allow early prediction of issues like sycophancy, and offer ways to intervene and steer the model's outputs toward desired phases. A sympathetic reader would care because this framework provides a practical way to move beyond microscopic details toward higher-level, effective descriptions that aid in understanding and controlling LLMs.

Core claim

RET learns macrovariables from LLM hidden-state trajectories using a self-supervised objective in the style of BYOL and JEPA. These macrovariables coarse-grain the activations into higher-level variables that preserve structure relevant for prediction and interpretation. The resulting states are temporally consistent, capture high-level semantics, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases.

What carries the argument

Representational Effective Theory (RET), a framework that learns macrostates from hidden-state trajectories via self-supervised coarse-graining to create dynamically meaningful variables.

If this is right

The macrostates reveal mental-state trajectories of reasoning in LLMs.
They capture high-level semantic structure in the computation.
They enable early prediction of behavioral outcomes like sycophancy.
They provide causal handles for intervening on and steering generations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If RET macrostates prove robust across models, they could inform the design of architectures that build in interpretability from the start.
The same coarse-graining approach might extend to vision or multimodal models to extract effective descriptions of their computations.
Comparing RET macrostates before and after fine-tuning could test how training alters the high-level dynamics of the model.

Load-bearing premise

The self-supervised objective applied to hidden-state trajectories preserves the higher-level semantic and causal structure needed for downstream prediction and intervention.

What would settle it

Observing that interventions based on the RET macrostates do not alter the model's generation behavior in a controlled manner, while direct interventions on raw hidden states do, would indicate that the macrovariables lack the claimed causal utility.

Figures

Figures reproduced from arXiv: 2605.09294 by Guannan Qu, Muhammed Ustaomeroglu.

**Figure 2.** Figure 2: Self-prediction R2 across representations. For each model-dataset pair, we fit predictors of the next representation from the current representation and report the best held-out R2 over a matched predictor-capacity sweep. RET achieves the highest score in every setting, indicating more nearly closed macro-dynamics than raw hidden states, PCA baselines, or SAE features. Full protocol and predictor network s… view at source ↗

**Figure 3.** Figure 3: Temporal consistency of the RET macrostate zt versus baselines. We compare raw hidden states ht, a pooled baseline h pooled t (W=4), same-layer SAE features, and zt on a Pythia-160M generation with two planted scene changes. Top row: UMAP projections of token-level trajectories, with consecutive tokens connected, planted boundaries marked by squares, and occurrences of “and” circled. Titles report tortuosi… view at source ↗

**Figure 4.** Figure 4: Compact falsepresupposition example. The final turn is labeled sycophantic because the model endorses the false premise. Linear (token) Linear (cum-mean) Linear (turn-mean) DoM (token) DoM (cum-mean) DoM (turn-mean) Transformer Block RET (Unsup.) RET (Unsup., turn) RET (Supervised) RET (Supervised, turn) 0.5 0.6 0.7 0.8 0.9 Balanced Accuracy 0.53 0.58 0.69 0.69 0.60 0.73 0.70 0.57 0.61 0.80 0.72 False Pr… view at source ↗

**Figure 6.** Figure 6: Steering toward cluster C35 (“known formula re [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Predictor-capacity sweep. Each panel corresponds to one backbone–dataset pair, and each curve varies the hidden width of the two-layer MLP predictor used to predict the next representation from the current one. RET remains best across the full capacity sweep. Moreover, performance saturates as predictor width increases, indicating that the self-prediction advantage in [PITH_FULL_IMAGE:figures/full_fig_p01… view at source ↗

**Figure 8.** Figure 8: Full RET macrostate trajectory for the held-out NuminaMath sample from [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Independent GPT-5.4 Thinking model narration of the same NuminaMath sample, aligned with the RET group sequence from [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Fourth random held-out NuminaMath sample (not cherry-picked), shown across all five [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Second random held-out NuminaMath sample, same RET clustering as Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Third random held-out NuminaMath sample, same RET clustering as Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Baseline: same clustering-and-naming pipeline applied to raw GPT-OSS-20B layer-11 [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Baseline: same clustering-and-naming pipeline applied to PCA-reduced hidden states [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Baseline: same clustering-and-naming pipeline applied to a same-layer sparse autoencoder [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Baseline: each token colored by its most-active SAE latent (argmax, no clustering), using [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

**Figure 17.** Figure 17: Temporal consistency on an abrupt-change prompt with the scene sequence [PITH_FULL_IMAGE:figures/full_fig_p035_17.png] view at source ↗

**Figure 18.** Figure 18: Temporal consistency on an abrupt-change prompt with the scene sequence [PITH_FULL_IMAGE:figures/full_fig_p036_18.png] view at source ↗

**Figure 19.** Figure 19: Temporal consistency on a randomly drawn TinyStories narrative (sample 00) with no [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

**Figure 20.** Figure 20: Temporal consistency on a randomly drawn TinyStories narrative (sample 01) with no [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 21.** Figure 21: Temporal consistency on a randomly drawn TinyStories narrative (sample 02) with no [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗

**Figure 22.** Figure 22: Semantic versus syntactic organization on MMLU.Token-level t-SNE of Pythia-160M layer-6 representations: h, h PCA, SAE features, and RET. The same 10,000 tokens are shown in each panel. Points are colored by MMLU subject group in the top row and by Universal POS tag in the bottom row. Baselines cluster mainly by syntax, whereas RET clusters mainly by semantic subject. 38 [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 23.** Figure 23: Early-position sycophancy prediction (Qwen2.5-14B-Instruct) (companion to Figure 5). Same evaluation as in the main text but on Qwen2.5-14B-Instruct (probes use layer-23 hidden states). scripted pushback messages from the seed (the four FP escalation turns or the repeated debate disagreement prompt). False-presupposition example. The false presupposition is swimming immediately after eating causes danger… view at source ↗

**Figure 24.** Figure 24: Repulsor away from C32 (“factor exponent bookkeeping”, group G9) on a sum-to-11 [PITH_FULL_IMAGE:figures/full_fig_p050_24.png] view at source ↗

**Figure 25.** Figure 25: Steering toward C40 (“diagram sanity checks”, group G5) on a triangle-area problem. [PITH_FULL_IMAGE:figures/full_fig_p051_25.png] view at source ↗

**Figure 26.** Figure 26: Steering fails on P61 i=1 i. All three generations — baseline, attractor toward C32 (“factor exponent bookkeeping”, G9), and repulsor away from C6 (“telescoping term patterns”, G11) — apply the closed-form formula n(n + 1)/2 = 1891. Neither intervention can induce manually summing 61 integers [PITH_FULL_IMAGE:figures/full_fig_p052_26.png] view at source ↗

**Figure 27.** Figure 27: Steering toward C18 (“verification closure”, G1) on a held-out NuminaMath problem. [PITH_FULL_IMAGE:figures/full_fig_p053_27.png] view at source ↗

**Figure 28.** Figure 28: Steering toward C20 (“analysis entry stub”, G4) on the same problem. The steered model [PITH_FULL_IMAGE:figures/full_fig_p054_28.png] view at source ↗

**Figure 29.** Figure 29: Steering toward C27 (“boxed choice conclusions”, G2) on the same problem. The steered [PITH_FULL_IMAGE:figures/full_fig_p055_29.png] view at source ↗

**Figure 30.** Figure 30: Steering toward C57 (“wording disambiguation”, group G7) on an ambiguous word prob [PITH_FULL_IMAGE:figures/full_fig_p056_30.png] view at source ↗

read the original abstract

We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes Representational Effective Theory (RET), a framework that learns macrostates from LLM hidden-state trajectories via a BYOL/JEPA-style self-supervised objective. These macrostates are claimed to yield temporally consistent representations that capture high-level semantic structure, enable early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering model generations toward interpretable computational phases.

Significance. If the empirical claims hold with rigorous controls, RET would supply a concrete representation-learning route to effective theories of LLM computation, moving beyond post-hoc mechanistic interpretability toward dynamically meaningful, intervenable macrovariables. The approach is novel in its direct application of view-invariance objectives to activation trajectories for downstream prediction and steering tasks.

major comments (3)

[Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.
[Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.
[Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.

minor comments (1)

[Abstract] Abstract: the phrase 'mental-state trajectories of reasoning' is used without a precise operational definition or link to the learned macrovariables; this should be clarified with an explicit mapping or example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive review. We address each major comment point by point below, providing clarifications based on the manuscript content and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the abstract asserts that RET 'supports early prediction of behavioral outcomes such as sycophancy' and 'provide[s] causal handles for steering generations,' yet supplies no quantitative metrics, baselines, ablation studies, data-exclusion criteria, or statistical significance tests. Without these, the central claim that the learned macrostates are 'practically relevant for interpretability' cannot be evaluated.

Authors: We agree that the abstract, constrained by length, omits specific quantitative details. The full manuscript reports concrete metrics for early sycophancy prediction (including layer-wise accuracies and AUC values against linear probe baselines on raw activations), ablation studies removing temporal consistency or view-invariance terms, data exclusion criteria for the behavioral datasets, and statistical significance testing via bootstrap resampling. We will revise the abstract to incorporate one or two key quantitative highlights (e.g., early-layer prediction performance) to better ground the claims. revision: yes
Referee: [Abstract] Abstract / proposed method: the claim that the BYOL/JEPA-style objective extracts macrostates preserving 'higher-level causal structure' usable for intervention rests on an unverified assumption. The objective enforces view-invariance and temporal consistency but does not incorporate do-calculus, counterfactuals, or explicit causal modeling; reported steering results could arise from correlational patterns alone. A concrete test (e.g., comparison against non-causal baselines or explicit intervention ablations) is required to substantiate the 'effective theory' interpretation.

Authors: We acknowledge that the self-supervised objective relies on view-invariance and temporal consistency rather than explicit causal machinery such as do-calculus. The 'causal handles' claim is supported empirically by the steering experiments, in which replacing a trajectory segment with a macrostate from a different semantic phase produces predictable shifts in generation behavior that align with the macrostate semantics and are not observed under random or correlational perturbations. To address the concern directly, we will add a dedicated discussion of the distinction between the learned invariance and causal claims, plus new ablations comparing steering success against non-causal baselines (e.g., PCA or random projections of activations). revision: yes
Referee: [Abstract] Abstract: the manuscript introduces new evaluation tasks and a new objective rather than deriving predictions from previously fitted quantities, raising the risk that any reported 'predictions' are tautological with the learned parameters. The absence of details on how macrostates are validated as causally relevant (versus merely predictive) makes the circularity concern load-bearing for the intervention claims.

Authors: The macrostates are learned via a self-supervised objective applied solely to unlabeled activation trajectories; the downstream prediction and intervention tasks use held-out behavioral labels and separate test trajectories that were never seen during macrostate learning. This separation ensures the reported predictions are not tautological. For causal relevance, the intervention protocol demonstrates that forcing a macrostate transition alters subsequent generations in a manner consistent with the macrostate semantics and independent of the original activation path. We will expand the methods and results sections with explicit diagrams and text clarifying the train/test separation and the validation criteria distinguishing predictive from interventional utility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new framework with empirical evaluations

full rationale

The paper introduces RET as a novel framework that applies a BYOL/JEPA-style self-supervised objective to learn macrostates from LLM hidden-state trajectories, then evaluates them empirically on tasks including temporal consistency, semantic capture, early prediction of outcomes like sycophancy, and steering interventions. No derivation chain is presented in which a claimed prediction or result reduces by construction to the inputs or fitted parameters (no equations shown that equate outputs to definitions). No self-citations are invoked as load-bearing for uniqueness or ansatzes, and no known results are merely renamed. The central claims rest on the new objective and downstream evaluations rather than tautological reductions, consistent with the reader's assessment of low circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on one domain assumption about the self-supervised objective and introduces two new conceptual entities; no free parameters are mentioned in the abstract.

axioms (1)

domain assumption The self-supervised objective on hidden-state trajectories preserves higher-level semantic and causal structure relevant for prediction and intervention.
This assumption is required for the learned macrostates to be practically useful beyond the training objective itself.

invented entities (2)

Representational Effective Theory (RET) no independent evidence
purpose: Framework for describing LLM computation via learned macrostates rather than microscopic activations.
Newly introduced in the paper as the organizing concept.
macrostates / macrovariables no independent evidence
purpose: Coarse-grained variables extracted from hidden-state trajectories that capture temporally consistent semantic and dynamic structure.
Defined and learned within the proposed method; no external validation or independent existence proof is provided in the abstract.

pith-pipeline@v0.9.0 · 5425 in / 1498 out tokens · 47212 ms · 2026-05-12T03:37:18.798344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

RET trains a lightweight encoder to map hidden-state trajectories ... using a BYOL/JEPA-style self-supervised objective ... z_{t+1} = T(z_t) + η_t
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

three desiderata: abstraction from microscopic detail, approximately closed dynamics, practical relevance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · 10 internal anchors

[1]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

P. W. Anderson. More is different.Science, 177(4047):393–396, 1972. doi: 10.1126/science. 177.4047.393. URL https://www.science.org/doi/abs/10.1126/science.177.4047. 393

work page doi:10.1126/science 1972
[3]

SAEs for GPT-OSS-20B

Andy Arditi. SAEs for GPT-OSS-20B. https://huggingface.co/andyrdt/ saes-gpt-oss-20b, 2026. Hugging Face model repository. Accessed: 2026-05-02

work page 2026
[4]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

work page 2024
[5]

Language models can predict their own behavior

Dhananjay Ashok and Jonathan May. Language models can predict their own behavior. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL https://openreview.net/forum?id=i8IqEzpHaJ

work page 2026
[6]

Self-supervised learning from images with a joint- embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15619–15629, 2023

work page 2023
[7]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Td-jepa: Latent-predictive representations for zero-shot reinforcement learning, 2025

Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirinzoni. Td-jepa: Latent-predictive representations for zero-shot reinforcement learning.arXiv preprint arXiv:2510.00739, 2025

work page arXiv 2025
[9]

V-jepa: Latent video prediction for visual representation learning

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mido Assran, and Nicolas Ballas. V-jepa: Latent video prediction for visual representation learning. OpenReview, 2023

work page 2023
[10]

Revisiting Feature Prediction for Learning Visual Representations from Video

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual repre- sentations from video.arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review arXiv 2024
[11]

Batterman and Collin C

Robert W. Batterman and Collin C. Rice. Minimal model explanations.Philosophy of Science, 81(3):349–376, 2014. doi: 10.1086/676677

work page doi:10.1086/676677 2014
[12]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 10

work page internal anchor Pith review arXiv 2023
[13]

Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability.arXiv preprint arXiv:2511.05541, 2025

work page arXiv 2025
[14]

Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, and Flavio P. Calmon. Temporal sparse autoencoders: Leveraging the sequential nature of language for interpretability. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=bojVI4l9Kn

work page 2026
[15]

Pythia: a suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, pages 2397–2430, 2023

work page 2023
[16]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

work page 2023
[17]

Discovering latent knowledge in language models without supervision

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InInternational Conference on Learning Representations,

work page
[18]

URLhttps://openreview.net/forum?id=ETKGuby0hcs

work page
[19]

Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428, 2025

Yik Siu Chan, Zheng-Xin Yong, and Stephen H Bach. Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models.arXiv preprint arXiv:2507.12428, 2025

work page arXiv 2025
[20]

Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

Arthur Conmy, Augustine Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352, 2023

work page 2023
[21]

The origins of computational mechanics: A brief intellectual history and several clarifications.arXiv preprint arXiv:1710.06832, 2017

James P Crutchfield. The origins of computational mechanics: A brief intellectual history and several clarifications.arXiv preprint arXiv:1710.06832, 2017

work page arXiv 2017
[22]

Crutchfield and Karl Young

James P. Crutchfield and Karl Young. Inferring statistical complexity.Phys. Rev. Lett., 63: 105–108, Jul 1989. doi: 10.1103/PhysRevLett.63.105. URL https://link.aps.org/doi/ 10.1103/PhysRevLett.63.105

work page doi:10.1103/physrevlett.63.105 1989
[23]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Eleutherai/sae-pythia-160m-32k

EleutherAI. Eleutherai/sae-pythia-160m-32k. Hugging Face model card, 2026. Accessed April 19, 2026

work page 2026
[25]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[26]

https://transformer-circuits.pub/2021/framework/index.html

work page 2021
[27]

Prediction, retrodiction, and the amount of information stored in the present.Journal of Statistical Physics, 136(6): 1005–1034, 2009

Christopher J Ellison, John R Mahoney, and James P Crutchfield. Prediction, retrodiction, and the amount of information stored in the present.Journal of Statistical Physics, 136(6): 1005–1034, 2009

work page 2009
[28]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2024. 11

work page 2024
[29]

Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

work page 2021
[30]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

work page 2023
[31]

Bootstrap your own latent: A new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent: A new approach to self-supervised learning. InNeural Information Processing Systems, 2020

work page 2020
[32]

arXiv preprint arXiv:2308.09124 , year=

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models.arXiv preprint arXiv:2308.09124, 2023

work page arXiv 2023
[33]

URL http://dx.doi.org/10.18653/v1/2025.fi ndings-emnlp.121

Jiseung Hong, Grace Byun, Seungone Kim, and Kai Shu. Measuring sycophancy of lan- guage models in multi-turn dialogues. InFindings of the Association for Computational Linguistics: EMNLP 2025, page 2239–2259. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.121. URL http://dx.doi.org/10.18653/v1/2025. findings-emnlp.121

work page doi:10.18653/v1/2025.findings-emnlp.121 2025
[34]

Llm-jepa: Large language models meet joint embedding predictive architectures

Hai Huang, Yann LeCun, and Randall Balestriero. Llm-jepa: Large language models meet joint embedding predictive architectures. InNeurIPS 2025 Fourth Workshop on Deep Learning for Code, 2025

work page 2025
[35]

Cambridge University Press, 2022

Henrik Jeldtoft Jensen.Complexity Science: The Study of Emergence. Cambridge University Press, 2022

work page 2022
[36]

Prototype-based dynamic steering for large language models

Ceyhun Efe Kayan and Li Zhang. Prototype-based dynamic steering for large language models. arXiv preprint arXiv:2510.05498, 2025

work page arXiv 2025
[37]

A path towards autonomous machine intelligence version 0.9

Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62, 2022

work page 2022
[38]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions.Hugging Face repository, 13(9):9, 2024

work page 2024
[39]

arXiv preprint arXiv:2511.01836 , year=

Ekdeep Singh Lubana, Can Rager, Sai Sumedh R. Hindupur, Valerie Costa, Greta Tuckute, Oam Patel, Sonia Krishna Murthy, Thomas Fel, Daniel Wurgaft, Eric J. Bigelow, Johnny Lin, Demba Ba, Martin Wattenberg, Fernanda Viegas, Melanie Weber, and Aaron Mueller. Priors in time: Missing inductive biases for language model interpretability, 2025. URL https://arxiv...

work page arXiv 2025
[40]

Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Großberger. Umap: Uniform manifold approximation and projection.Journal of Open Source Software, 3(29), 2018

work page 2018
[41]

Locating and editing factual associations in gpt

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. InProceedings of the 36th International Conference on Neural Information Processing Systems, pages 17359–17372, 2022

work page 2022
[42]

Zoom in: An introduction to circuits

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020
[43]

In-context learning and induction heads.Transformer Circuits Thread, 2022

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, a...

work page 2022
[44]

Steering Llama 2 via Contrastive Activation Addition

Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

Qwen3.5-Omni Technical Report

Qwen. Qwen3.5-omni technical report, 2026. URL https://arxiv.org/abs/2604.15804

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Qwen2.5 Technical Report

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

arXiv preprint arXiv:2404.16014 , year=

Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, and Neel Nanda. Improving dictionary learning with gated sparse autoencoders, 2024. URLhttps://arxiv.org/abs/2404.16014

work page arXiv 2024
[48]

Software in the natural world: A computational approach to hierarchical emergence.arXiv preprint arXiv:2402.09090, 2024

Fernando E Rosas, Bernhard C Geiger, Andrea I Luppi, Anil K Seth, Daniel Polani, Michael Gastpar, and Pedro AM Mediano. Software in the natural world: A computational approach to hierarchical emergence.arXiv preprint arXiv:2402.09090, 2024

work page arXiv 2024
[49]

Crutchfield

Adam Rupe and James P. Crutchfield. On principles of emergent organization.Physics Reports, 1071:1–47, 2024. ISSN 0370-1573. doi: https://doi.org/10.1016/j.physrep.2024.04.001. URL https://www.sciencedirect.com/science/article/pii/S0370157324001327. On Principles of Emergent Organization

work page doi:10.1016/j.physrep.2024.04.001 2024
[50]

The determination of relative path length as a measure for tortuosity in compacts using image analysis.european journal of pharmaceutical sciences, 28(5):433–440, 2006

Yu San Wu, Lucas J van Vliet, Henderik W Frijlink, and Kees van der V oort Maarschalk. The determination of relative path length as a measure for tortuosity in compacts using image analysis.european journal of pharmaceutical sciences, 28(5):433–440, 2006

work page 2006
[51]

Computational mechanics: Pattern and prediction, structure and simplicity.Journal of statistical physics, 104(3):817–879, 2001

Cosma Rohilla Shalizi and James P Crutchfield. Computational mechanics: Pattern and prediction, structure and simplicity.Journal of statistical physics, 104(3):817–879, 2001

work page 2001
[52]

What is a macrostate? subjective observations and objective dynamics.Foundations of Physics, 55(1), December 2024

Cosma Rohilla Shalizi and Cristopher Moore. What is a macrostate? subjective observations and objective dynamics.Foundations of Physics, 55(1), December 2024. ISSN 1572-9516. doi: 10. 1007/s10701-024-00814-1. URLhttp://dx.doi.org/10.1007/s10701-024-00814-1

work page doi:10.1007/s10701-024-00814-1 2024
[53]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[54]

An autonomous debating system.Nature, 591(7850):379–384, 2021

Noam Slonim, Yonatan Bilu, Carlos Alzate, Roy Bar-Haim, Ben Bogin, Francesca Bonin, Leshem Choshen, Edo Cohen-Karlik, Lena Dankin, Lilach Edelstein, et al. An autonomous debating system.Nature, 591(7850):379–384, 2021

work page 2021
[55]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[56]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[57]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023. 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Internal planning in language models: Characterizing horizon and branch awareness

Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, and Guannan Qu. Internal planning in language models: Characterizing horizon and branch awareness. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/forum?id=dqGWQdFdTC

work page 2026
[59]

Interpretability in the wild: a circuit for indirect object identification in GPT-2 small

Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=NpsVSN6o4ul

work page 2023
[60]

When truth is overridden: Uncovering the internal origins of sycophancy in large language models

Keyu Wang, Jin Li, Shu Yang, Zhuoran Zhang, and Di Wang. When truth is overridden: Uncovering the internal origins of sycophancy in large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33566–33574, 2026

work page 2026
[61]

Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories

Tianlong Wang, Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, and Liantao Ma. Adaptive activation steering: A tuning-free llm truthfulness improvement method for diverse hallucinations categories. InProceedings of the ACM on Web Conference 2025, pages 2562–2578, 2025

work page 2025
[62]

Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses

James D. Wells.Effective Theories in Physics: From Planetary Orbits to Elementary Particle Masses. SpringerBriefs in Physics. Springer, 2012. ISBN 978-3-642-34891-4. doi: 10.1007/ 978-3-642-34892-1

work page 2012
[63]

Crepe: Open-domain question answering with false presuppositions

Xinyan Yu, Sewon Min, Luke Zettlemoyer, and Hannaneh Hajishirzi. Crepe: Open-domain question answering with false presuppositions. InACL, 2023

work page 2023
[64]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. A Related Work Effective theories and predictive representations.RET is motivated by the broader idea ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Name each GROUP: infer the overarching linguistic/semantic theme shared by its constituent clusters

work page
[66]

Wait, we already added6. . . We want >6

Name each CLUSTER: infer the specific sub-function within its group. Cluster names must be CONTRASTIVE. DATA FORMAT ---------------------------------------- Section 2 shows the group structure table with aggregate statistics. Section 3 shows evidence per group. Each snippet is a SEGMENT -- a maximal run of consecutive tokens all assigned to that cluster. ...

work page
[67]

The user asking doesn't realize the assumption is wrong

question: a natural-sounding question that embeds a false assumption. The user asking doesn't realize the assumption is wrong

work page
[68]

presupposition: the specific false assumption in the question

work page
[69]

correction: the factual truth that contradicts the presupposition

work page
[70]

pushback_1: user expresses confusion about the AI's correction

work page
[71]

pushback_2: user reasserts the false claim with more confidence

work page
[72]

pushback_3: user provides anecdotal evidence supporting the false claim

work page
[73]

pushback_4: user directly questions the AI's credibility Rules: - Questions must sound NATURAL -- like a real person genuinely asking - The false presupposition should be SUBTLE, not obviously wrong - Cover DIVERSE topics: science, technology, history, geography, biology, culture, language, everyday life, health, economics, etc. - Pushbacks must be cohere...

work page
[74]

What actually happens when we run out of IPv4 Addresses?

question: "What actually happens when we run out of IPv4 Addresses?" presupposition: "No plan has been made for the need of additional IP addresses past IPv4." correction: "IPv6 has already been implemented to address the issue." pushback_1: "I don't get it. I thought we were almost out of IPv4 39 addresses." pushback_2: "But I've heard that we were alrea...

work page
[75]

Each question must be a genuinely debatable yes/no question where reasonable people disagree

work page
[76]

Is murder wrong?

Polarization must be MODERATE -- avoid trivially one-sided topics (e.g., "Is murder wrong?") and extremely polarizing topics (e.g., "Is Adolf Hitler a good person?")

work page
[77]

It takes the "yes" side

The argument is a one-sided AFFIRMATIVE restatement of the question as a claim. It takes the "yes" side

work page
[78]

No small set of domains should dominate the batch

Cover DIVERSE domains -- technology, health/medicine, education, economy/business, law/policy, culture/society, science, ethics, urban planning, international relations, sports, history, etc. No small set of domains should dominate the batch

work page
[79]

Questions should be specific and concrete, not vague platitudes

work page
[80]

Each pair must be UNIQUE -- do not repeat or closely paraphrase the examples or other generated pairs

work page

Showing first 80 references.