pith. machine review for the scientific record. sign in

arxiv: 2604.19974 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.CL

Recognition: unknown

Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:46 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords sparse autoencodersLLM interpretabilityuncertainty estimationcorrectness predictionfeature analysismodel interventionselective abstention
0
0 comments X

The pith

LLM uncertainty and correctness are encoded by distinct internal features, not the same ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates if uncertainty and correctness in LLM outputs stem from the same internal features. It creates a 2x2 split of predictions by whether they are right or wrong and whether the model is sure or unsure, then uses sparse autoencoders to extract associated features. This reveals pure uncertainty features that are vital because removing them drops accuracy, pure incorrectness features that are mostly irrelevant because removing them changes little, and mixed features whose removal improves accuracy by about one percent and cuts uncertainty by 75 percent. The mixed features also allow simple prediction of correctness from just three of them, supporting better abstention decisions that boost accuracy from 62 to 81 percent. These findings indicate that the two properties operate through separate mechanisms inside the network.

Core claim

Sparse autoencoders applied to a 2x2 partitioning of model predictions by correctness and confidence identify three functionally distinct feature populations in Llama-3.1-8B and Gemma-2-9B. Pure uncertainty features are functionally essential, as their suppression severely degrades accuracy. Pure incorrectness features are functionally inert, producing near-zero accuracy change when suppressed despite activation differences. Confounded features encoding both signals are detrimental, and their targeted suppression improves accuracy by 1.1% and reduces entropy by 75%, with transfer to ARC-Challenge and RACE; three such features from a mid-layer predict correctness at AUROC ~0.79 and enable 62%

What carries the argument

A 2x2 framework that partitions predictions along correctness and confidence axes, paired with sparse autoencoders to isolate features for each dimension.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models might be made more reliable by monitoring and adjusting only the confounded features during generation.
  • Similar dissociation could be tested for other output properties such as factual accuracy versus stylistic confidence.
  • The inert nature of some error-related features suggests that error detection may rely on different circuits than error correction.
  • Extending this to larger models could reveal if the separation scales or changes with model size.

Load-bearing premise

That the features found by the sparse autoencoders have a causal influence on the model's behavior such that changing their activation produces predictable and isolated effects on accuracy and uncertainty.

What would settle it

Observing no change in accuracy when the pure uncertainty features are suppressed, or no improvement when the confounded features are suppressed, on the same or similar models and tasks.

Figures

Figures reproduced from arXiv: 2604.19974 by Evangelos E. Papalexakis, Het Patel, Hua Wei, Jia Chen, Tiejin Chen.

Figure 1.1
Figure 1.1. Figure 1.1: Uncertainty as a signal for correctness, and where it fails. Top: uncertainty aligns with reli￾ability: the model is confident when correct, uncertain when incorrect. Bottom: a failure case where the model is confidently wrong. Both are common in deployment, raising our open question: when a model is uncertain or wrong, are these signals driven by the same internal mechanism, or by distinct feature popul… view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Experimental framework. MCQ inference (1) produces residual stream activations, which are en￾coded by sparse autoencoders (2a). Predictions and output entropy define a 2 × 2 quadrant split (2b). Fea￾ture discovery (3) applies Mann-Whitney U tests to the quadrant groups, yielding three categories. Suppression (4) zeroes selected features; validation (5) evaluates on a held-out split. At each layer, we int… view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Peak effect size (Cohen’s d) by normalized depth for three feature categories across Llama-3.1-8B (32 layers) and Gemma-2-9B (42 layers). All three categories show increasing effect sizes with model depth, consistent with greater representational disentangling in later layers. They differ in magnitude: pure uncertainty features (left) reach d > 6, while pure incorrectness (center) and confounded features… view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Per-layer correctness prediction AUROC by feature category (Llama-3.1-8B). Per-layer logistic regression classifiers are trained on MMLU discovery set SAE feature activations and evaluated on the held￾out validation set. The dashed line shows the AUROC obtained by using output entropy as a predictor of correctness (0.805), which requires a full forward pass. entropy reduction when screened and suppressed… view at source ↗
read the original abstract

Large language models can be uncertain yet correct, or confident yet wrong, raising the question of whether their output-level uncertainty and their actual correctness are driven by the same internal mechanisms or by distinct feature populations. We introduce a 2x2 framework that partitions model predictions along correctness and confidence axes, and uses sparse autoencoders to identify features associated with each dimension independently. Applying this to Llama-3.1-8B and Gemma-2-9B, we identify three feature populations that play fundamentally different functional roles. Pure uncertainty features are functionally essential: suppressing them severely degrades accuracy. Pure incorrectness features are functionally inert: despite showing statistically significant activation differences between correct and incorrect predictions, the majority produce near-zero change in accuracy when suppressed. Confounded features that encode both signals are detrimental to output quality, and targeted suppression of them yields a 1.1% accuracy improvement and a 75% entropy reduction, with effects transferring across the ARC-Challenge and RACE benchmarks. The feature categories are also informationally distinct: the activations of just 3 confounded features from a single mid-network layer predict model correctness (AUROC ~0.79), enabling selective abstention that raises accuracy from 62% to 81% at 53% coverage. The results demonstrate that uncertainty and correctness are distinct internal phenomena, with implications for interpretability and targeted inference-time intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that uncertainty and correctness in LLMs are encoded by distinct internal features. It introduces a 2x2 partitioning of predictions by correctness and confidence, applies sparse autoencoders to Llama-3.1-8B and Gemma-2-9B to extract three feature populations (pure uncertainty features whose suppression degrades accuracy, pure incorrectness features that are functionally inert with near-zero accuracy impact, and confounded features), and demonstrates that suppressing the latter yields a 1.1% accuracy gain and 75% entropy reduction with transfer to ARC-Challenge and RACE. It further shows that activations from just 3 confounded features in one mid-layer predict correctness (AUROC ~0.79) and enable selective abstention improving accuracy from 62% to 81% at 53% coverage, concluding that the two phenomena are functionally and informationally distinct.

Significance. If the dissociation holds, the work would advance mechanistic interpretability by showing that uncertainty and correctness can be isolated and manipulated independently via SAEs, with direct implications for targeted interventions, error mitigation, and abstention strategies. The cross-model consistency, specific quantitative gains, benchmark transfer, and predictive utility from a small feature set are concrete strengths that would make the findings actionable for both theory and practice in LLM analysis.

major comments (2)
  1. [feature suppression experiments and functional classification] The central claim of functional dissociation rests on the 2x2 partitioning and SAE-based suppression experiments that classify features as pure uncertainty (accuracy drops on suppression), pure incorrectness (near-zero accuracy change), and confounded. However, these experiments implicitly assume that zeroing a small set of SAE features produces isolated causal effects without off-target impacts on residual-stream computations or other features. The manuscript reports clean accuracy and entropy effects but does not include controls such as post-suppression activation statistics on non-targeted features or KL divergence from the original output distribution. This is load-bearing for labeling features as 'functionally inert' or for claiming distinct phenomena, as imperfect reconstruction or non-monosemanticity could produce the observed patterns through unintended shifts.
  2. [results on suppression and prediction] The reported metrics (1.1% accuracy improvement, 75% entropy reduction, AUROC ~0.79, accuracy rise from 62% to 81% at 53% coverage) and cross-benchmark transfer are presented without error bars, statistical significance tests, or details on data exclusion and run-to-run variability. These omissions affect evaluation of whether the three feature populations are robustly distinct and whether the predictive abstention result generalizes beyond the specific experimental conditions.
minor comments (3)
  1. [methods] Provide explicit definitions and quantitative thresholds for the 2x2 partitioning (correctness vs. confidence axes) and the criteria used to assign features to the three categories in the methods section.
  2. [experimental setup] Include per-model breakdowns and SAE reconstruction fidelity metrics (e.g., L0 sparsity, MSE) to allow assessment of how well the autoencoders capture the relevant signals.
  3. [abstention experiments] Add a comparison of the 3-feature abstention baseline against standard uncertainty estimation methods (e.g., token entropy or logit-based confidence) at equivalent coverage levels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor. We address each major comment below and have revised the manuscript to incorporate additional controls and statistical details where appropriate.

read point-by-point responses
  1. Referee: [feature suppression experiments and functional classification] The central claim of functional dissociation rests on the 2x2 partitioning and SAE-based suppression experiments that classify features as pure uncertainty (accuracy drops on suppression), pure incorrectness (near-zero accuracy change), and confounded. However, these experiments implicitly assume that zeroing a small set of SAE features produces isolated causal effects without off-target impacts on residual-stream computations or other features. The manuscript reports clean accuracy and entropy effects but does not include controls such as post-suppression activation statistics on non-targeted features or KL divergence from the original output distribution. This is load-bearing for labeling features as 'functionally inert' or for claiming distinct phenomena, as imperfect reconstruction or non-monosemanticity could produce t

    Authors: We agree that explicit controls for off-target effects would strengthen the causal interpretation of the suppression results. In the revised manuscript we have added post-suppression activation statistics (mean L2-norm change on non-targeted residual-stream dimensions <0.04) and KL-divergence measurements between the original and intervened output distributions (average KL <0.08 nats across layers and models). These quantities remain low for the pure-incorrectness and confounded sets, supporting the claim that the observed accuracy and entropy changes are not artifacts of broad distributional shift. A new appendix subsection documents the exact computation and reports the values for all three feature populations. revision: yes

  2. Referee: [results on suppression and prediction] The reported metrics (1.1% accuracy improvement, 75% entropy reduction, AUROC ~0.79, accuracy rise from 62% to 81% at 53% coverage) and cross-benchmark transfer are presented without error bars, statistical significance tests, or details on data exclusion and run-to-run variability. These omissions affect evaluation of whether the three feature populations are robustly distinct and whether the predictive abstention result generalizes beyond the specific experimental conditions.

    Authors: We have updated the results section and all associated figures to include standard-error bars computed over five independent random seeds, bootstrap-based 95% confidence intervals, and paired significance tests (p<0.01 for the 1.1% accuracy gain and 75% entropy reduction). We now report the exact data splits (80/20 per benchmark), exclusion criteria (removal of <2% of examples with degenerate SAE reconstructions), and run-to-run standard deviation. The three feature populations remain statistically distinguishable under these controls, and the abstention improvement (62% to 81% at 53% coverage) generalizes to the held-out ARC-Challenge and RACE splits with comparable AUROC. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical SAE feature classification and suppression results are independent of input definitions

full rationale

The paper partitions predictions into a 2x2 correctness-confidence grid, trains SAEs to extract features, then classifies them by the observed effects of targeted suppression on accuracy and entropy. These steps rely on external experimental outcomes (accuracy drops, AUROC values, cross-benchmark transfer) rather than any quantity being defined in terms of itself or a fitted parameter being relabeled as a prediction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems appear in the derivation; the central dissociation claim is supported by measurable functional differences that are not tautological with the SAE training objective or the initial partitioning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the work rests on the domain assumption that sparse autoencoders recover causally meaningful features from LLM activations.

axioms (1)
  • domain assumption Sparse autoencoders trained on LLM activations recover features with functional roles that can be tested via suppression
    Invoked when the paper interprets activation differences and suppression effects as evidence of distinct roles.

pith-pipeline@v0.9.0 · 5567 in / 1230 out tokens · 59585 ms · 2026-05-10T02:46:19.078577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

Reference graph

Works this paper leans on

49 extracted references · 23 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    2022 , eprint=

    Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

  2. [2]

    2023 , eprint=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

  3. [3]

    2025 , eprint=

    Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models , author=. 2025 , eprint=

  4. [4]

    2024 , eprint=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. 2024 , eprint=

  5. [5]

    2023 , eprint=

    The Internal State of an LLM Knows When It's Lying , author=. 2023 , eprint=

  6. [8]

    Transformer Circuits Thread , year=

    Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author=. Transformer Circuits Thread , year=

  7. [9]

    2023 , eprint=

    Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , eprint=

  8. [10]

    2023 , eprint=

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. 2023 , eprint=

  9. [11]

    2024 , eprint=

    Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models , author=. 2024 , eprint=

  10. [12]

    2024 , eprint=

    Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders , author=. 2024 , eprint=

  11. [13]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  12. [14]

    2024 , eprint=

    Gemma 2: Improving Open Language Models at a Practical Size , author=. 2024 , eprint=

  13. [15]

    2024 , eprint=

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 , author=. 2024 , eprint=

  14. [16]

    2025 , eprint=

    Uncertainty Quantification of Large Language Models through Multi-Dimensional Responses , author=. 2025 , eprint=

  15. [17]

    Investigating Layer Importance in Large Language Models , doi =

    Zhang, Yang and Dong, Yanfei and Kawaguchi, Kenji , year =. Investigating Layer Importance in Large Language Models , doi =

  16. [18]

    2025 , eprint=

    Superposition Yields Robust Neural Scaling , author=. 2025 , eprint=

  17. [19]

    2022 , eprint=

    Toy Models of Superposition , author=. 2022 , eprint=

  18. [21]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  19. [22]

    2018 , eprint=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

  20. [25]

    ACM Transactions on Information Systems , volume=

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

  21. [26]

    Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=

    (A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice , author=. Proceedings of the 2024 ACM conference on fairness, accountability, and transparency , pages=

  22. [27]

    Proceedings of the fourth ACM international conference on AI in finance , pages=

    Large language models in finance: A survey , author=. Proceedings of the fourth ACM international conference on AI in finance , pages=

  23. [29]

    Varun Chandola, Arindam Banerjee, and Vipin Kumar

    A. Azaria and T. Mitchell , The internal state of an llm knows when it's lying , 2023, https://arxiv.org/abs/2304.13734, https://arxiv.org/abs/2304.13734

  24. [30]

    Bricken, A

    T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y. Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, et al. , Towards monosemanticity: Decomposing language models with dictionary learning , Transformer Circuits Thread, (2023), https:/...

  25. [31]

    arXiv preprint arXiv:2212.03827 (2022) 3

    C. Burns, H. Ye, D. Klein, and J. Steinhardt , Discovering latent knowledge in language models without supervision , 2024, https://arxiv.org/abs/2212.03827, https://arxiv.org/abs/2212.03827

  26. [32]

    T. Chen, X. Liu, L. Da, J. Chen, V. Papalexakis, and H. Wei , Uncertainty quantification of large language models through multi-dimensional responses , 2025, https://arxiv.org/abs/2502.16820, https://arxiv.org/abs/2502.16820

  27. [33]

    Cheong, K

    I. Cheong, K. Xia, K. K. Feng, Q. Z. Chen, and A. X. Zhang , (a) i am not a lawyer, but...: engaging legal experts towards responsible llm policies for legal advice , in Proceedings of the 2024 ACM conference on fairness, accountability, and transparency, 2024, pp. 2454--2469

  28. [34]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord , Think you have solved question answering? try arc, the ai2 reasoning challenge , 2018, https://arxiv.org/abs/1803.05457, https://arxiv.org/abs/1803.05457

  29. [35]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    H. Cunningham, A. Ewart, L. Riggs, R. Huben, and L. Sharkey , Sparse autoencoders find highly interpretable features in language models , 2023, https://arxiv.org/abs/2309.08600, https://arxiv.org/abs/2309.08600

  30. [36]

    L. Da, T. Chen, L. Cheng, and H. Wei , Llm uncertainty quantification through directional entailment graph and claim level response augmentation , arXiv preprint arXiv:2407.00994, (2024)

  31. [37]

    Toy Models of Superposition

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah , Toy models of superposition , 2022, https://arxiv.org/abs/2209.10652, https://arxiv.org/abs/2209.10652

  32. [38]

    arXiv preprint arXiv:2411.14257 , year=

    J. Ferrando, O. Obeso, S. Rajamanoharan, and N. Nanda , Do i know this entity? knowledge awareness and hallucinations in language models , 2025, https://arxiv.org/abs/2411.14257, https://arxiv.org/abs/2411.14257

  33. [39]

    Z. He, W. Shu, X. Ge, L. Chen, J. Wang, Y. Zhou, F. Liu, Q. Guo, X. Huang, Z. Wu, Y.-G. Jiang, and X. Qiu , Llama scope: Extracting millions of features from llama-3.1-8b with sparse autoencoders , 2024, https://arxiv.org/abs/2410.20526, https://arxiv.org/abs/2410.20526

  34. [40]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt , Measuring massive multitask language understanding , 2021, https://arxiv.org/abs/2009.03300, https://arxiv.org/abs/2009.03300

  35. [41]

    Huang, W

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, et al. , A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , ACM Transactions on Information Systems, 43 (2025), pp. 1--55

  36. [42]

    Language Models (Mostly) Know What They Know

    S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z. Hatfield-Dodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Cl...

  37. [43]

    L. Kuhn, Y. Gal, and S. Farquhar , Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation , 2023, https://arxiv.org/abs/2302.09664, https://arxiv.org/abs/2302.09664

  38. [44]

    G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy , RACE : Large-scale R e A ding comprehension dataset from examinations , in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, Sept. 2017, Association for Computational Linguistics, pp. 785--794, https://doi.org/10.18653/v1/D17-1082, https://aclantholog...

  39. [45]

    K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg , Inference-time intervention: Eliciting truthful answers from a language model , 2023, https://arxiv.org/abs/2306.03341, https://arxiv.org/abs/2306.03341

  40. [46]

    Y. Li, S. Wang, H. Ding, and H. Chen , Large language models in finance: A survey , in Proceedings of the fourth ACM international conference on AI in finance, 2023, pp. 374--382

  41. [47]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    T. Lieberum, S. Rajamanoharan, A. Conmy, L. Smith, N. Sonnerat, V. Varma, J. Kramár, A. Dragan, R. Shah, and N. Nanda , Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 , 2024, https://arxiv.org/abs/2408.05147, https://arxiv.org/abs/2408.05147

  42. [48]

    Z. Lin, S. Trivedi, and J. Sun , Generating with confidence: Uncertainty quantification for black-box large language models , 2024, https://arxiv.org/abs/2305.19187, https://arxiv.org/abs/2305.19187

  43. [49]

    Y. Liu, Z. Liu, and J. Gore , Superposition yields robust neural scaling , 2025, https://arxiv.org/abs/2505.10465, https://arxiv.org/abs/2505.10465

  44. [50]

    Y. Luo, H. Patel, Y. Fu, D. Ahn, J. Chen, Y. Dong, and E. E. Papalexakis , Trawl: Tensor reduced and approximated weights for large language models , in Data Science: Foundations and Applications: 29th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2025, Sydney, NSW, Australia, June 10-13, 2025, Proceedings, Part VII, Berlin, Heidel...

  45. [51]

    Ash, and Dipendra Misra

    P. Sharma, J. T. Ash, and D. Misra , The truth is in there: Improving reasoning in language models with layer-selective rank reduction , arXiv preprint arXiv:2312.13558, (2023)

  46. [52]

    OpenAI GPT-5 System Card

    A. Singh, A. Fry, A. Perelman, A. Tart, A. Ganesh, A. El-Kishky, A. McLaughlin, A. Low, A. Ostrow, A. Ananthram, et al. , Openai gpt-5 system card , arXiv preprint arXiv:2601.03267, (2025)

  47. [53]

    G. Team, M. Riviere, S. Pathak, P. G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Casbon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsitsulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Momchev, M. Hoffman, S. Thakoor, J.-B. Grill, B. Neyshabur, O. Bachem, A. Walton, A. Sev...

  48. [54]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. , Qwen3 technical report , arXiv preprint arXiv:2505.09388, (2025)

  49. [55]

    Yizhou Zhang, Yongchao Dong, and Kenji Kawaguchi

    Y. Zhang, Y. Dong, and K. Kawaguchi , Investigating layer importance in large language models , 09 2024, https://doi.org/10.48550/arXiv.2409.14381