Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Dong Eui Chang; Muhammad Usama

arxiv: 2605.23315 · v1 · pith:ZIVMQLGNnew · submitted 2026-05-22 · 💻 cs.CL · cs.AI

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

Muhammad Usama , Dong Eui Chang This is my paper

Pith reviewed 2026-05-25 04:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsrepresentational convergencereasoning processesCKA similaritymodel interpretabilityensemble methodsPlatonic Representation Hypothesisablation analysis

0 comments

The pith

Representational convergence across language models stems from shared input processing constraints rather than shared reasoning strategies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the documented similarity in internal representations among large language models, known as the Platonic Representation Hypothesis, also applies to the reasoning processes that use those representations. Testing 16 models from eight families on 800 problems in math, science, commonsense, and truthfulness, the authors measure similarity with Centered Kernel Alignment while stratifying by difficulty, stage, and causal relevance. They identify three clear dissociations: greater alignment on problems models fail than on those they solve, alignment before but not after a decision is made, and shared information that can be decoded but has almost no causal effect on outputs. If correct, these results mean that current similarity measures capture input constraints more than reasoning agreement, which changes how ensembles, interpretability, and model comparisons should be approached.

Core claim

On 800 reasoning problems, CKA similarity is higher for problems the models collectively fail (0.897) than for those they solve (0.830), higher in pre-decision layers (0.875) than post-decision layers (0.274), and while representations allow 66 percent transfer accuracy, ablating the shared information changes predictions in only 1.5 to 5.5 percent of cases. These patterns show that representational convergence reflects common input processing constraints instead of common reasoning strategies.

What carries the argument

Three dissociations—difficulty inversion, generation gap, and epiphenomenal correctness—measured by CKA similarity, transfer accuracy, and ablation flip rates across stratified reasoning problems.

If this is right

Ensemble design should prioritize diversity in reasoning steps even when representations align.
Interpretability methods developed on one model are unlikely to transfer directly to the reasoning of another model.
Similarity evaluations based solely on representations may overestimate agreement on how problems are solved.
Model comparisons that rely on early-layer or pre-decision activations capture input constraints more than final reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future similarity benchmarks could weight post-decision representations more heavily to better track reasoning alignment.
Training regimes that increase causal influence of shared representations might reduce the observed generation gap.
The difficulty inversion suggests that scaling laws for accuracy and for representation similarity may operate on different mechanisms.

Load-bearing premise

The metrics of CKA similarity, transfer accuracy, and ablation flip rates successfully separate input processing constraints from reasoning strategies on the selected problems and model families.

What would settle it

Finding high flip rates when shared representations are ablated or comparable CKA values in post-decision layers would indicate that the observed dissociations do not hold.

Figures

Figures reproduced from arXiv: 2605.23315 by Dong Eui Chang, Muhammad Usama.

**Figure 3.** Figure 3: Per-domain difficulty inversion. CKA versus difficulty for each reasoning domain. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-wise difficulty inversion. The inversion between CKA and difficulty emerges [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Epiphenomenal correctness. Left: transfer probe accuracy (66%, above 55% base [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Attention entropy and difficulty. All 6 models show a negative correlation ( [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Baseline analyses. (a) Randomly initialized models show higher representational [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports three dissociations showing high CKA similarity across LLMs does not mean shared reasoning steps, but the ablation results need checks for whether they truly isolate causal influence.

read the letter

The key point here is that the authors measured representational similarity on reasoning tasks and found it does not line up with actual model behavior in three ways: higher CKA on problems the models fail than on ones they solve, strong alignment in pre-decision layers but sharp drop after the decision point, and high cross-model transfer accuracy that barely changes predictions when ablated. This is presented as evidence that convergence comes from shared input constraints rather than shared reasoning strategies. They ran this across 16 models in 8 families on 800 problems in math, science, commonsense, and truthfulness, with stratification by difficulty and stage, and they released the code. That scale and the concrete numbers are the main new pieces; extending the Platonic Representation Hypothesis to reasoning processes with these specific splits had not been done before. The generation gap numbers in particular stand out as a clean separation of stages. The work is empirical measurement rather than new theory, and the citation pattern is appropriate for the subfield. The soft spots sit in the methods that the abstract does not detail. Problem selection criteria, exact controls for training data overlap, and whether the ablation protocols hit decision variables or just correlated input features all matter for the epiphenomenal correctness claim. The stress-test concern about low flip rates occurring even with shared reasoning if the interventions are off-target is worth verifying in the full text; if it holds, the causal interpretation weakens. No statistical tests are mentioned for the CKA or transfer differences, which is a gap but probably fixable. This is for people working on interpretability, model similarity metrics, and ensemble methods who need to know what CKA actually predicts about downstream behavior. It deserves a serious referee because the empirical scope is large enough and the question is direct, even though the causal language will need tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper claims that representational convergence among LLMs (measured via CKA) does not extend to shared reasoning processes. Across 16 models (8 families, 1.5B–72B) and 800 stratified reasoning problems, it reports three dissociations: (1) difficulty inversion (CKA 0.897 on collective failures vs. 0.830 on successes), (2) generation gap (CKA 0.875 pre-decision vs. 0.274 post-decision), and (3) epiphenomenal correctness (66% cross-model transfer accuracy but only 1.5–5.5% prediction flip rates under ablation). It concludes that convergence reflects shared input-processing constraints rather than shared reasoning strategies, with implications for ensembles, interpretability, and similarity metrics. Code is released.

Significance. If the dissociations are robust, the work provides a useful empirical counterpoint to the Platonic Representation Hypothesis by separating input constraints from causal reasoning. The scale (16 models, 800 problems, multiple families) and public code are clear strengths that enable follow-up. The results could inform ensemble design and caution against assuming representational similarity implies interchangeable reasoning.

major comments (2)

[Abstract] Abstract and methods: The central claim rests on the three quantitative dissociations, yet the abstract and reported protocol provide no details on statistical testing for CKA differences, exact problem selection criteria, stratification controls, or confound checks (e.g., training-data overlap). This directly affects whether the reported gaps support the input-vs-reasoning dissociation.
[Epiphenomenal correctness] Epiphenomenal correctness section: The argument that shared representations exert 'minimal causal influence' relies on 66% transfer accuracy paired with 1.5–5.5% flip rates under ablation. If the ablated activations are correlated with input statistics but not the actual decision variables used by each model, low flip rates can occur without implying absence of shared reasoning steps; the manuscript does not demonstrate that the interventions isolate causal reasoning features.

minor comments (2)

[Generation gap analysis] Clarify the precise layer or token positions used to define 'pre-decision' versus 'post-decision' representations and how these align across model families of different depths.
[Problem selection] The problem set stratification by 'causal relevance' should be described with explicit criteria or examples so readers can assess whether shared failure modes are controlled.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, with planned revisions to strengthen the manuscript where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract and methods: The central claim rests on the three quantitative dissociations, yet the abstract and reported protocol provide no details on statistical testing for CKA differences, exact problem selection criteria, stratification controls, or confound checks (e.g., training-data overlap). This directly affects whether the reported gaps support the input-vs-reasoning dissociation.

Authors: We agree that the abstract and methods lack sufficient detail on these elements. In the revision we will (1) add bootstrap-based statistical tests with p-values for all reported CKA differences, (2) expand the problem selection and stratification description to specify the exact difficulty metric, domain balancing procedure, and sampling controls, and (3) include explicit confound checks for input length, tokenization differences, and available training-data overlap information. These additions will be reflected in both the abstract and a dedicated methods subsection. revision: yes
Referee: [Epiphenomenal correctness] Epiphenomenal correctness section: The argument that shared representations exert 'minimal causal influence' relies on 66% transfer accuracy paired with 1.5–5.5% flip rates under ablation. If the ablated activations are correlated with input statistics but not the actual decision variables used by each model, low flip rates can occur without implying absence of shared reasoning steps; the manuscript does not demonstrate that the interventions isolate causal reasoning features.

Authors: This concern is well-taken; the current ablation results alone do not fully rule out that the targeted activations primarily reflect input statistics. We will add two new analyses in the revision: (a) direct comparison of flip rates when ablating early-layer (input-dominated) versus mid-to-late-layer activations, and (b) correlation of the ablated activations with both input features and final logits. These controls will either strengthen or appropriately qualify the causal claim. If the additional analyses remain inconclusive, we will revise the language in the discussion to reflect the remaining ambiguity. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements with no self-referential derivations

full rationale

The paper reports observational results from applying CKA similarity, transfer accuracy, and ablation protocols to existing pre-trained models on a fixed problem set. No equations, predictions, or first-principles claims appear; the dissociations (difficulty inversion, generation gap, epiphenomenal correctness) are direct outputs of the chosen metrics rather than quantities defined in terms of fitted parameters or prior self-citations within the work. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard assumptions from representational similarity literature and causal intervention methods in machine learning; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Centered Kernel Alignment (CKA) is an appropriate metric for comparing internal representations across models
Used to quantify all reported similarities without further justification in the abstract.
domain assumption Ablation protocols and transfer accuracy measure causal influence on model predictions
Central to the epiphenomenal correctness claim.

pith-pipeline@v0.9.0 · 5786 in / 1286 out tokens · 26288 ms · 2026-05-25T04:58:18.103590+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadalla, Hany Awadallah, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704,

work page arXiv
[3]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

work page arXiv 1912
[8]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

Fabian Gröger, Shuo Wen, and Maria Brbić. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

work page arXiv
[9]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Bridging critical gaps in convergent learning: How representational alignment evolves across layers, training, and distribution shifts.arXiv preprint arXiv:2502.18710,

Chaitanya Kapoor, Sudhanshu Srivastava, and Meenakshi Khosla. Bridging critical gaps in convergent learning: How representational alignment evolves across layers, training, and distribution shifts.arXiv preprint arXiv:2502.18710,

work page arXiv
[11]

Causes and consequences of representational similarity in machine learning models.arXiv preprint arXiv:2505.13899,

Zeyu Li, Hung Anh Vu, Damilola Awofisayo, and Emily Wenger. Causes and consequences of representational similarity in machine learning models.arXiv preprint arXiv:2505.13899,

work page arXiv
[12]

Gemma: Open Models Based on Gemini Research and Technology

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Kale, J. Christopher Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012,

Alex Murphy, Joel Zylberberg, and Alona Fyshe. Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012,

work page arXiv
[14]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. Olmo 2: Furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Jingren Zhou, and Junyang Lin. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

auto" to split layers across GPU and CPU. The 70B models (LLaMA-3.1-70B and Qwen-2.5-72B) were evaluated on a separate server with 2× NVIDIA A100 80GB GPUs usingdevice_map=

Appendix A. Implementation Details All experiments were conducted on a single NVIDIA RTX 5090 GPU (32GB VRAM) with an Intel Core Ultra 7 265K CPU (20 cores, 62GB RAM). Models were loaded in bfloat16 precision and evaluated with greedy decoding (temperature 0, no sampling). Hidden states were extracted at the last input token position prior to generation. ...

work page 2022

[1] [1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadalla, Hany Awadallah, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Singh Behl, et al. Phi-3 technical report: A highly capable language model locally on your phone.arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, et al. Nemotron-4 340b technical report.arXiv preprint arXiv:2406.11704,

work page arXiv

[3] [3]

InternLM2 Technical Report

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. Internlm2 technical report.arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape perspective.arXiv preprint arXiv:1912.02757,

work page arXiv 1912

[8] [8]

Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

Fabian Gröger, Shuo Wen, and Maria Brbić. Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

work page arXiv

[9] [9]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint arXiv:2...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Bridging critical gaps in convergent learning: How representational alignment evolves across layers, training, and distribution shifts.arXiv preprint arXiv:2502.18710,

Chaitanya Kapoor, Sudhanshu Srivastava, and Meenakshi Khosla. Bridging critical gaps in convergent learning: How representational alignment evolves across layers, training, and distribution shifts.arXiv preprint arXiv:2502.18710,

work page arXiv

[11] [11]

Causes and consequences of representational similarity in machine learning models.arXiv preprint arXiv:2505.13899,

Zeyu Li, Hung Anh Vu, Damilola Awofisayo, and Emily Wenger. Causes and consequences of representational similarity in machine learning models.arXiv preprint arXiv:2505.13899,

work page arXiv

[12] [12]

Gemma: Open Models Based on Gemini Research and Technology

Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Kale, J. Christopher Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012,

Alex Murphy, Joel Zylberberg, and Alona Fyshe. Correcting biased centered kernel alignment measures in biological and artificial neural networks.arXiv preprint arXiv:2405.01012,

work page arXiv

[14] [14]

Gemma 2: Improving Open Language Models at a Practical Size

Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, et al. Olmo 2: Furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Jingren Zhou, and Junyang Lin. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

auto" to split layers across GPU and CPU. The 70B models (LLaMA-3.1-70B and Qwen-2.5-72B) were evaluated on a separate server with 2× NVIDIA A100 80GB GPUs usingdevice_map=

Appendix A. Implementation Details All experiments were conducted on a single NVIDIA RTX 5090 GPU (32GB VRAM) with an Intel Core Ultra 7 265K CPU (20 cores, 62GB RAM). Models were loaded in bfloat16 precision and evaluated with greedy decoding (temperature 0, no sampling). Hidden states were extracted at the last input token position prior to generation. ...

work page 2022