Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning
Pith reviewed 2026-05-25 04:58 UTC · model grok-4.3
The pith
Representational convergence across language models stems from shared input processing constraints rather than shared reasoning strategies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On 800 reasoning problems, CKA similarity is higher for problems the models collectively fail (0.897) than for those they solve (0.830), higher in pre-decision layers (0.875) than post-decision layers (0.274), and while representations allow 66 percent transfer accuracy, ablating the shared information changes predictions in only 1.5 to 5.5 percent of cases. These patterns show that representational convergence reflects common input processing constraints instead of common reasoning strategies.
What carries the argument
Three dissociations—difficulty inversion, generation gap, and epiphenomenal correctness—measured by CKA similarity, transfer accuracy, and ablation flip rates across stratified reasoning problems.
If this is right
- Ensemble design should prioritize diversity in reasoning steps even when representations align.
- Interpretability methods developed on one model are unlikely to transfer directly to the reasoning of another model.
- Similarity evaluations based solely on representations may overestimate agreement on how problems are solved.
- Model comparisons that rely on early-layer or pre-decision activations capture input constraints more than final reasoning.
Where Pith is reading between the lines
- Future similarity benchmarks could weight post-decision representations more heavily to better track reasoning alignment.
- Training regimes that increase causal influence of shared representations might reduce the observed generation gap.
- The difficulty inversion suggests that scaling laws for accuracy and for representation similarity may operate on different mechanisms.
Load-bearing premise
The metrics of CKA similarity, transfer accuracy, and ablation flip rates successfully separate input processing constraints from reasoning strategies on the selected problems and model families.
What would settle it
Finding high flip rates when shared representations are ablated or comparable CKA values in post-decision layers would indicate that the observed dissociations do not hold.
Figures
read the original abstract
Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that representational convergence among LLMs (measured via CKA) does not extend to shared reasoning processes. Across 16 models (8 families, 1.5B–72B) and 800 stratified reasoning problems, it reports three dissociations: (1) difficulty inversion (CKA 0.897 on collective failures vs. 0.830 on successes), (2) generation gap (CKA 0.875 pre-decision vs. 0.274 post-decision), and (3) epiphenomenal correctness (66% cross-model transfer accuracy but only 1.5–5.5% prediction flip rates under ablation). It concludes that convergence reflects shared input-processing constraints rather than shared reasoning strategies, with implications for ensembles, interpretability, and similarity metrics. Code is released.
Significance. If the dissociations are robust, the work provides a useful empirical counterpoint to the Platonic Representation Hypothesis by separating input constraints from causal reasoning. The scale (16 models, 800 problems, multiple families) and public code are clear strengths that enable follow-up. The results could inform ensemble design and caution against assuming representational similarity implies interchangeable reasoning.
major comments (2)
- [Abstract] Abstract and methods: The central claim rests on the three quantitative dissociations, yet the abstract and reported protocol provide no details on statistical testing for CKA differences, exact problem selection criteria, stratification controls, or confound checks (e.g., training-data overlap). This directly affects whether the reported gaps support the input-vs-reasoning dissociation.
- [Epiphenomenal correctness] Epiphenomenal correctness section: The argument that shared representations exert 'minimal causal influence' relies on 66% transfer accuracy paired with 1.5–5.5% flip rates under ablation. If the ablated activations are correlated with input statistics but not the actual decision variables used by each model, low flip rates can occur without implying absence of shared reasoning steps; the manuscript does not demonstrate that the interventions isolate causal reasoning features.
minor comments (2)
- [Generation gap analysis] Clarify the precise layer or token positions used to define 'pre-decision' versus 'post-decision' representations and how these align across model families of different depths.
- [Problem selection] The problem set stratification by 'causal relevance' should be described with explicit criteria or examples so readers can assess whether shared failure modes are controlled.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, with planned revisions to strengthen the manuscript where the concerns are valid.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods: The central claim rests on the three quantitative dissociations, yet the abstract and reported protocol provide no details on statistical testing for CKA differences, exact problem selection criteria, stratification controls, or confound checks (e.g., training-data overlap). This directly affects whether the reported gaps support the input-vs-reasoning dissociation.
Authors: We agree that the abstract and methods lack sufficient detail on these elements. In the revision we will (1) add bootstrap-based statistical tests with p-values for all reported CKA differences, (2) expand the problem selection and stratification description to specify the exact difficulty metric, domain balancing procedure, and sampling controls, and (3) include explicit confound checks for input length, tokenization differences, and available training-data overlap information. These additions will be reflected in both the abstract and a dedicated methods subsection. revision: yes
-
Referee: [Epiphenomenal correctness] Epiphenomenal correctness section: The argument that shared representations exert 'minimal causal influence' relies on 66% transfer accuracy paired with 1.5–5.5% flip rates under ablation. If the ablated activations are correlated with input statistics but not the actual decision variables used by each model, low flip rates can occur without implying absence of shared reasoning steps; the manuscript does not demonstrate that the interventions isolate causal reasoning features.
Authors: This concern is well-taken; the current ablation results alone do not fully rule out that the targeted activations primarily reflect input statistics. We will add two new analyses in the revision: (a) direct comparison of flip rates when ablating early-layer (input-dominated) versus mid-to-late-layer activations, and (b) correlation of the ablated activations with both input features and final logits. These controls will either strengthen or appropriately qualify the causal claim. If the additional analyses remain inconclusive, we will revise the language in the discussion to reflect the remaining ambiguity. revision: partial
Circularity Check
No circularity: empirical measurements with no self-referential derivations
full rationale
The paper reports observational results from applying CKA similarity, transfer accuracy, and ablation protocols to existing pre-trained models on a fixed problem set. No equations, predictions, or first-principles claims appear; the dissociations (difficulty inversion, generation gap, epiphenomenal correctness) are direct outputs of the chosen metrics rather than quantities defined in terms of fitted parameters or prior self-citations within the work. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Centered Kernel Alignment (CKA) is an appropriate metric for comparing internal representations across models
- domain assumption Ablation protocols and transfer accuracy measure causal influence on model predictions
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadalla, Hany Awadallah, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704,
-
[3]
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,
Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,
-
[8]
Fabian Gröger, Shuo Wen, and Maria Brbić. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,
-
[9]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Chaitanya Kapoor, Sudhanshu Srivastava, and Meenakshi Khosla. Bridging critical gaps in convergent learning: How representational alignment evolves across layers, training, and distribution shifts.arXiv preprint arXiv:2502.18710,
-
[11]
Zeyu Li, Hung Anh Vu, Damilola Awofisayo, and Emily Wenger. Causes and consequences of representational similarity in machine learning models.arXiv preprint arXiv:2505.13899,
-
[12]
Gemma: Open Models Based on Gemini Research and Technology
Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Kale, J. Christopher Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Alex Murphy, Joel Zylberberg, and Alona Fyshe. Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012,
-
[14]
Gemma 2: Improving Open Language Models at a Practical Size
Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. Olmo 2: Furious.arXiv preprint arXiv:2501.00656,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Jingren Zhou, and Junyang Lin. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Appendix A. Implementation Details All experiments were conducted on a single NVIDIA RTX 5090 GPU (32GB VRAM) with an Intel Core Ultra 7 265K CPU (20 cores, 62GB RAM). Models were loaded in bfloat16 precision and evaluated with greedy decoding (temperature 0, no sampling). Hidden states were extracted at the last input token position prior to generation. ...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.