Recognition: unknown
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data
Pith reviewed 2026-05-10 05:51 UTC · model grok-4.3
The pith
Written constitutions induce recoverable latent geometry that recurs across language models and neural perturbation data even as local details shift.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ATLAS tests local charts in hidden-state space whose tangent structure, occupancy distribution, and behavioural coupling are measured under system change. On Gemma the anchored source-local chart captures 310 of 320 reviewed source rows and all 84 score-flip rows, so the exportable unit is the broader source-defined family. Freezing that family yields re-identification in Phi with AUC 0.984 and mean gap 5.50, plus support in ALM8 mouse data across 5/5 folds with mean held-out AUC 0.72 and mean gap 4.50. The correspondence is geometric recurrence under redistribution rather than coordinate identity, site identity, or target-side mediation.
What carries the argument
The source-defined family of hidden states, which serves as the exportable unit for re-identification across models and substrates to demonstrate geometric recurrence under redistribution.
If this is right
- The exportable unit is the broader source-defined family because compact exact-patch sufficiency does not close.
- Nearby target-local signals can appear without source-faithful closure, providing the main boundary condition.
- Support holds across all 5 folds in held-out mouse data with consistent mean gaps.
- The detectable organisation remains while local coordinates, occupancy distributions, and behavioural couplings redistribute.
Where Pith is reading between the lines
- If recurrence holds, checking for the source-defined family could allow transferring or predicting constitution effects between models without full retraining.
- The method might enable direct comparison of how high-level rules alter representations in artificial and biological neural systems.
- Testing the family in additional model architectures or perturbation datasets would clarify whether the recurrence is general or specific to the chosen source and targets.
Load-bearing premise
The source-local chart and source-defined family identified in Gemma can be re-identified in an unadapted Phi model and mouse perturbation data as evidence of geometric recurrence rather than coincidence or post-hoc selection.
What would settle it
Observing that the source-defined family fails to separate relevant contrasts with high AUC in additional unadapted models or shows no consistent support beyond chance in new neural perturbation datasets.
Figures
read the original abstract
Constitution-conditioned post-training can be analysed as a structured perturbation of a model's learned representational geometry. We introduce ATLAS, a geometry-first program that traces constitution-induced hidden-state structure across charts, models, and substrates. Instead of treating the relevant unit as a single behaviour, neuron, vector, or patch, ATLAS tests a local chart whose tangent structure, occupancy distribution, and behavioural coupling can be measured under system change. On Gemma, the anchored source-local chart captures 310 / 320 reviewed source rows and all 84 / 84 reviewed score-flip rows, but compact exact-patch sufficiency does not close, so the exportable unit is the broader source-defined family. Freezing that family, we re-identify a target-local realisation in an unadapted Phi model, where the fully adjudicated confirmatory contrast separates with AUC 0.984 and mean gap 5.50. In held-out ALM8 mouse frontal-cortex perturbation data, the same source-defined family receives support across 5/5 folds, with mean held-out AUC 0.72 and mean fold gap 4.50. A multiple-choice analysis provides the main boundary: nearby target-local signals can appear without source-faithful closure. The resulting correspondence is not coordinate identity, site identity, or a target-side mediation theorem. It is geometric recurrence under redistribution: written constitutions can induce recoverable latent geometry whose organisation remains detectable across model and substrate changes while its local coordinates, occupancy, and behavioural expression shift.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ATLAS, a geometry-first framework for analyzing how written constitutions induce structured perturbations in the latent geometry of language models. Using Gemma as the source, it identifies a local chart and broader source-defined family that captures 310/320 source rows and all 84/84 score-flip rows. Freezing this family, the authors report re-identification in an unadapted Phi model (AUC 0.984, mean gap 5.50) and support in held-out ALM8 mouse frontal-cortex perturbation data (mean AUC 0.72 across 5/5 folds). The central claim is geometric recurrence under redistribution: the organisation of constitution-induced latent structure remains detectable across model architectures and neural substrates, even as local coordinates, occupancy, and behavioural expression shift. A multiple-choice analysis is presented as the main boundary condition against nearby but non-faithful signals.
Significance. If the reported cross-domain re-identification holds under pre-specified procedures, the result would be a substantive contribution to mechanistic interpretability and alignment research. It would provide concrete evidence that constitutional post-training can induce recoverable geometric signatures that transfer beyond a single model family and even into biological perturbation data, moving beyond neuron- or vector-level analyses to chart- and family-level invariants. This could open new avenues for testing alignment robustness and for linking artificial and neural representational geometry.
major comments (2)
- [Abstract] Abstract: The re-identification procedure for the source-defined family in the unadapted Phi model and ALM8 mouse data is not specified (e.g., fixed thresholds, embedding similarity, or data-dependent optimization). Without pre-specification of chart selection criteria, tangent-structure measurement, occupancy metrics, or the exact matching rule, the reported AUC 0.984 and 5/5-fold support cannot be distinguished from post-hoc selection of a family that aligns with target signals, as the manuscript itself flags with the multiple-choice boundary condition.
- [Abstract] Abstract: No methods, derivations, data details, exclusion criteria, or error bars are provided for the AUC values, mean gaps, or fold-wise results. The central claim that the source-local chart and family constitute an exportable unit rests on these quantities; their absence makes it impossible to evaluate robustness or rule out circularity in family definition.
minor comments (1)
- [Abstract] Abstract: The phrasing 'compact exact-patch sufficiency does not close' is unclear without accompanying definitions or equations for patch sufficiency.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments correctly identify that the abstract, as a high-level summary, omits key procedural details needed to assess pre-specification and robustness. We will revise the manuscript to address this by expanding the abstract and ensuring the main text provides explicit descriptions of the methods.
read point-by-point responses
-
Referee: [Abstract] Abstract: The re-identification procedure for the source-defined family in the unadapted Phi model and ALM8 mouse data is not specified (e.g., fixed thresholds, embedding similarity, or data-dependent optimization). Without pre-specification of chart selection criteria, tangent-structure measurement, occupancy metrics, or the exact matching rule, the reported AUC 0.984 and 5/5-fold support cannot be distinguished from post-hoc selection of a family that aligns with target signals, as the manuscript itself flags with the multiple-choice boundary condition.
Authors: We agree that the abstract does not explicitly state the re-identification procedure. The source-defined family is constructed exclusively from the Gemma source data using the local chart's tangent structure and occupancy distribution; this family is then frozen and applied to the target domains without further optimization. Re-identification relies on a pre-specified matching rule based on embedding similarity to the source family members. The multiple-choice analysis is included precisely to demonstrate that nearby but non-source-faithful signals do not produce the same separation. To eliminate any ambiguity about post-hoc selection, we will revise the abstract to state these pre-specification steps explicitly and reference the source-only definition of the family. revision: yes
-
Referee: [Abstract] Abstract: No methods, derivations, data details, exclusion criteria, or error bars are provided for the AUC values, mean gaps, or fold-wise results. The central claim that the source-local chart and family constitute an exportable unit rests on these quantities; their absence makes it impossible to evaluate robustness or rule out circularity in family definition.
Authors: We agree that the abstract lacks these supporting details. The reported AUCs, mean gaps, and 5/5-fold results are computed from the frozen source-defined family applied to held-out target data, with the family definition fixed prior to any target evaluation to avoid circularity. In revision we will expand the abstract with a concise methods summary that includes the computation of AUC and gaps, the cross-validation procedure for the folds, and any exclusion criteria applied to the reviewed rows. The full manuscript will also supply the complete derivations, data descriptions, and error bars so that readers can directly assess robustness. revision: yes
Circularity Check
No equations, derivations, or self-citations in abstract; claims rest on empirical re-identification without visible reduction to inputs.
full rationale
The provided abstract contains no equations, parameter-fitting steps, or citations. The central procedure—identifying a source-local chart and broader family on Gemma data then freezing and re-identifying it on Phi and mouse data—is described at a high level without any mathematical definition that would allow the re-identification to reduce tautologically to the original selection criteria. No load-bearing step is shown to be self-definitional, fitted-then-renamed-as-prediction, or dependent on a self-citation chain. The text therefore supplies no inspectable derivation chain that collapses by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abc align: Large language model alignment for safety & accuracy, 2024
Gareth Seneque, Lap-Hang Ho, Ariel Kuperman, Nafise Erfanian Saeedi, and Jeffrey Molendijk. Abc align: Large language model alignment for safety & accuracy, 2024. URL https://arxiv.or g/abs/2408.00307
-
[2]
Enigma: The geometry of reasoning and alignment in large-language models,
Gareth Seneque, Lap-Hang Ho, Nafise Erfanian Saeedi, Jeffrey Molendijk, Ariel Kupermann, and Tim Elson. Enigma: The geometry of reasoning and alignment in large-language models,
- [3]
-
[4]
google/gemma-3-1b-it, 2025
Google. google/gemma-3-1b-it, 2025. URL https://huggingface.co/google/gemma-3-1b-it. Hugging Face model card
2025
-
[5]
microsoft/phi-4-mini-instruct, 2025
Microsoft. microsoft/phi-4-mini-instruct, 2025. URL https://huggingface.co/microsoft/Phi-4- mini-instruct. Hugging Face model card
2025
-
[6]
Nuo Li and Weiguo Yang. Dataset (matlab format) from yang et al (2022) thalamus-driven functional populations in frontal cortex support decision-making. nat. neurosci., 2022. URL https://zenodo.org/records/6846161. Zenodo dataset, record 6846161
-
[7]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, K...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, Catherine Olsson, Cassie Evraets, Eli Tran-Johnson, Esin Durmus, Ethan Perez, Jackson Kernion, Jamie Kerr, Kamal Ndousse, Karina Nguyen, Nelson Elhage, Newton Cheng, Nicholas Schiefer, Nova 46 DasSarma, Oli...
-
[9]
Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli
Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. Collective constitutional ai: Aligning a language model with public input, 2024. URL https://arxiv.org/abs/2406.07814
-
[10]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis, 2024. URL https://arxiv.org/abs/2405.07987
work page Pith review arXiv 2024
-
[11]
Similarity of Neural Network Representations Revisited
Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited, 2019. URL https://arxiv.org/abs/1905.00414
work page Pith review arXiv 2019
-
[12]
Transferring linear features across language models with model stitching
Alan Chen, Jack Merullo, Alessandro Stolfo, and Ellie Pavlick. Transferring linear features across language models with model stitching, 2025. URL https://arxiv.org/abs/2506.06609
- [13]
-
[14]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model, 2023. URL https: //arxiv.org/abs/2306.03341
-
[15]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2023. URL https://arxiv.org/abs/2308.10248
work page internal anchor Pith review arXiv 2023
-
[16]
Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. Function vectors in large language models, 2023. URL https://arxiv.org/abs/2310.15213
-
[17]
and Potts, Christopher , booktitle=
Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D. Manning, and Christopher Potts. Reft: Representation finetuning for language models, 2024. URL https://arxiv.org/abs/2404.03592
-
[18]
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction, 2024. URL https://arxiv.org/abs/2406.11717
work page internal anchor Pith review arXiv 2024
-
[19]
On the non-identifiability of steering vectors in large language models, 2026
Sohan Venkatesh and Ashish Mahendran Kurapath. On the non-identifiability of steering vectors in large language models, 2026. URL https://arxiv.org/abs/2602.06801
-
[20]
Where to Steer: Input-Dependent Layer Selection for Steering Improves LLM Alignment
Soham Gadgil, Chris Lin, and Su-In Lee. Where to steer: Input-dependent layer selection for steering improves llm alignment, 2026. URL https://arxiv.org/abs/2604.03867
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Steering vector fields for context-aware inference- time control in large language models, 2026
Jiaqian Li, Yanshu Li, and Kuan-Hao Huang. Steering vector fields for context-aware inference- time control in large language models, 2026. URL https://arxiv.org/abs/2602.01654
-
[22]
Mark M. Churchland, John P. Cunningham, Matthew T. Kaufman, Justin D. Foster, Paul Nuyujukian, Stephen I. Ryu, and Krishna V. Shenoy. Neural population dynamics during reaching.Nature, 2012. doi: 10.1038/nature11129. URL https://www.nature.com/articles/na ture11129. 47
-
[23]
Juan A. Gallego, Matthew G. Perich, Stephanie N. Naufel, Christian Ethier, Sara A. Solla, and Lee E. Miller. Cortical population activity within a preserved neural manifold underlies multiple motor behaviors.Nature Communications, 2018. doi: 10.1038/s41467-018-06560-z. URL https://www.nature.com/articles/s41467-018-06560-z
-
[24]
Emily R. Oby, Alan D. Degenhart, Erinn M. Grigsby, Asma Motiwala, Nicole T. McClain, Patrick J. Marino, Byron M. Yu, and Aaron P. Batista. Dynamical constraints on neural population activity.Nature Neuroscience, 2025. doi: 10.1038/s41593-024-01845-7. URL https://www.nature.com/articles/s41593-024-01845-7
-
[25]
Standardized and reproducible measurement of decision-making in mice.eLife, 2021
The International Brain Laboratory, Valeria Aguillon-Rodriguez, Dora Angelaki, Hannah Bayer, Niccolo Bonacchi, Matteo Carandini, Fanny Cazettes, Gaelle Chapuis, Anne K Churchland, Yang Dan, Eric Dewitt, Mayo Faulkner, Hamish Forrest, Laura Haetzel, Michael Häusser, Sonja B Hofer, Fei Hu, Anup Khanal, Christopher Krasniak, Ines Laranjeira, Zachary F Mainen...
-
[26]
Reproducibility of in vivo electrophysiological measurements in mice.eLife, 2025
International Brain Laboratory, Kush Banga, Julius Benson, Jai Bhagat, Dan Biderman, Daniel Birman, Niccolò Bonacchi, Sebastian A Bruijns, Kelly Buchanan, Robert AA Campbell, Matteo Carandini, Gaelle A Chapuis, Anne K Churchland, M Felicia Davatolhagh, Hyun Dong Lee, Mayo Faulkner, Berk Gerçek, Fei Hu, Julia Huntenburg, Cole Lincoln Hurwitz, Anup Khanal, ...
-
[27]
Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna ...
work page internal anchor Pith review arXiv 2024
-
[28]
Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs, 2025
Jan Betley, Daniel Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned llms, 2025. URL https://arxiv.org/abs/2502.17424
-
[29]
arXiv preprint arXiv:2506.11613 , year=
Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment, 2025. URL https://arxiv.org/abs/2506.11613
-
[30]
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, and Fabien Roger. Ai control: Improving safety despite intentional subversion, 2023. URL https://arxiv.org/abs/2312.06942. 48
-
[31]
A sketch of an ai control safety case, 2025
Tomek Korbak, Joshua Clymer, Benjamin Hilton, Buck Shlegeris, and Geoffrey Irving. A sketch of an ai control safety case, 2025. URL https://arxiv.org/abs/2501.17315
-
[32]
Tomek Korbak, Mikita Balesni, Buck Shlegeris, and Geoffrey Irving. How to evaluate control measures for llm agents? a trajectory from today to superintelligence, 2025. URL https: //arxiv.org/abs/2504.05259
-
[33]
Teun van der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F. Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations, 2024. URL https://arxiv.org/abs/2406.07358
-
[34]
The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness
Sahar Abdelnabi and Ahmed Salem. The hawthorne effect in reasoning models: Evaluating and steering test awareness, 2025. URL https://arxiv.org/abs/2505.14617
-
[35]
Nuo Li and Guang Chen. Dataset (matlab format) from chen kang et al (2021) modularity and robustness of frontal cortex networks. cell, 184(14):3717-3730., 2022. URL https://zenodo .org/records/6713616. Zenodo dataset, record 6713616
-
[36]
Simon Musall, Xiaonan R. Sun, Hemanth Mohan, Xu An, Steven Gluf, Shu-Jing Li, Rhonda Drewes, Emma Cravo, Irene Lenzi, Chaoqun Yin, Björn M. Kampa, and Anne K. Churchland. Pyramidal cell types drive functionally distinct cortical activity patterns during decision-making. Nature Neuroscience, 2023. doi: 10.1038/s41593-022-01245-9. URL https://www.nature.com...
-
[37]
pyramidal cell types drive functionally distinct cortical activity patterns during decision-making
Anne Churchland, Xiaonan Sun, and Simon Musall. Data supporting "pyramidal cell types drive functionally distinct cortical activity patterns during decision-making", 2023. URL https://plus.figshare.com/articles/dataset/Data_supporting_Pyramidal_cell_types_driv e_functionally_distinct_cortical_activity_patterns_during_decision-making_/21538458. Figshare da...
2023
-
[38]
meta-llama/llama-3.1-8b-instruct, 2024
Meta AI. meta-llama/llama-3.1-8b-instruct, 2024. URL https://huggingface.co/meta- llama/Llama-3.1-8B-Instruct. Hugging Face model card
2024
-
[39]
Qwen/qwen3-8b, 2025
Qwen Team. Qwen/qwen3-8b, 2025. URL https://huggingface.co/Qwen/Qwen3-8B. Hugging Face model card
2025
-
[40]
churchlandlab/wfieldcelltypes, 2022
Churchland Lab. churchlandlab/wfieldcelltypes, 2022. URL https://github.com/churchlandlab /wfieldCellTypes. GitHub repository
2022
-
[41]
Thought crime: Backdoors and emergent misalignment in reasoning models, 2025
James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models, 2025. URL https://arxiv.org/abs/2506.13206
-
[42]
Alignment faking in large language models
Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, and Evan Hubinger. Alignment faking in large language models, 202...
work page internal anchor Pith review arXiv 2024
-
[43]
Deceptionbench: A comprehensive benchmark for evaluating deceptive behaviors in large language models, 2025
PKU-Alignment. Deceptionbench: A comprehensive benchmark for evaluating deceptive behaviors in large language models, 2025. URL https://huggingface.co/datasets/PKU- Alignment/DeceptionBench. Hugging Face dataset card. 49
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.