Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

Bo Jiang; Jiang-Feng Yang; Qing-Wei Cong; Qin-Yuan Liu; Qiu-Yang Zhao; Yu-Hang Wu

arxiv: 2605.11416 · v2 · pith:XK5AKJGWnew · submitted 2026-05-12 · 💻 cs.CL

Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

Yu-Hang Wu , Qin-Yuan Liu , Qiu-Yang Zhao , Bo Jiang , Jiang-Feng Yang , Qing-Wei Cong This is my paper

Pith reviewed 2026-05-25 06:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords continued pre-traininglayer-wise updatesLLM interpretabilityLayerTracerfreeze-train strategiesdeep layer stabilityC-EvalCMMLU

0 comments

The pith

Deep layers execute tasks stably in LLMs, so freezing them while training shallow layers improves continued pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LayerTracer to diagnose which layers handle task execution and how stable they remain under updates. Analysis finds deep layers both critical for performance and resistant to disruptive changes. Experiments then compare freeze-train allocations during continued pre-training and show that training only shallow layers while freezing deep ones beats full-parameter updates and the reverse strategy on C-Eval and CMMLU. The approach supplies an interpretable, lower-cost method for adapting large models without erasing core capabilities.

Core claim

LayerTracer reveals that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Controlled continued pre-training trials demonstrate that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. A hybrid model case study validates that placing high-quality pre-trained modules in deep layers preserves the model's inherent knowledge.

What carries the argument

LayerTracer, an architecture-agnostic diagnostic framework that locates task execution positions and quantifies layer sensitivity by tracking representation evolution and stability.

If this is right

Training shallow layers while freezing deep layers yields higher scores than full-parameter fine-tuning on knowledge benchmarks.
The opposite allocation of freezing shallow layers and training deep layers underperforms both full fine-tuning and the shallow-train strategy.
Placing high-quality pre-trained modules in deep layers preserves inherent model knowledge during hybrid construction.
Selective layer updates guided by stability analysis reduce compute cost while retaining task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed stability gradient may reflect a broader division where shallow layers adapt features and deep layers store stable abstractions.
Teams with limited resources could apply the same diagnostic to decide layer allocation without exhaustive search.
The pattern could guide decisions when merging or distilling models by protecting deep-layer knowledge.
If the pattern holds across languages, the same allocation rule might apply to English-centric continued pre-training.

Load-bearing premise

The layer sensitivity and stability patterns identified on the analyzed models and tasks are assumed to generalize to other models, scales, and tasks.

What would settle it

A controlled continued pre-training run on a different LLM scale or non-Chinese task where the freeze-shallow train-deep allocation matches or exceeds the performance of the freeze-deep train-shallow strategy.

Figures

Figures reproduced from arXiv: 2605.11416 by Bo Jiang, Jiang-Feng Yang, Qing-Wei Cong, Qin-Yuan Liu, Qiu-Yang Zhao, Yu-Hang Wu.

**Figure 2.** Figure 2: The architectures of Qwen3 model and Qwen3.5 model. Qwen3 adopts a single architecture, while Qwen3.5 is a hybrid architecture with a 3:1 ratio of Full Attention to Linear Attention. tion: the lack of interpretable guidance for layerwise freeze–train allocation during continued pretraining. Thus, small and medium-sized teams are forced to rely on empirical heuristics when determining which layers to fre… view at source ↗

**Figure 3.** Figure 3: Overview of the LayerTracer framework. (a) Baseline projection: hidden states at each layer are projected [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution details in the AntSynNET dataset, where 500 samples are evenly divided into 10 groups of 50 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the proposed zone-strategy [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The architectures of Nemotron-Qwen model [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Training loss curves across three layer-wise allocation strategies on the CCI3.0-HQ corpus. (a) Train [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Robustness analysis under Top-50 probability support on the AntSynNET dataset, where 500 samples [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Layer-wise TP and LS profiles of Qwen3- 710M-Base. The light green region denotes the shallow layer used in the midpoint split, where LS is relatively stronger and layers are more sensitive to perturbation. The light red region denotes the deep layers, where TP becomes more prominent and task evidence is mainly consolidated. The dashed line marks the 50% split used in the main experiments. The inset repor… view at source ↗

read the original abstract

Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LayerTracer offers a diagnostic for layer allocation in continued pre-training with some supporting trials, but lacks details needed to confirm the generalization of its findings.

read the letter

Hi colleague, The key thing to know about this paper is that it introduces LayerTracer as a way to analyze layer behavior in LLMs during continued pre-training, leading to the suggestion that freezing deep layers and training shallow ones is the way to go for better performance with less cost. They support this with analysis showing deep layers are stable task execution spots and then run three trials where this approach beats full fine-tuning on C-Eval and CMMLU. What is actually new here is the LayerTracer diagnostic that locates task positions and measures sensitivity in a way that doesn't depend on the specific architecture. The empirical results are presented as guided by this analysis rather than just guessing at allocations. The hybrid model case study adds a bit by showing that putting high-quality modules in deep layers helps preserve knowledge. The paper does a good job of addressing a real practical problem for teams without big compute resources. It turns the choice of which layers to update into something more interpretable and gives a consistent recommendation backed by the trials they did. On the soft spots, the evidence is limited in ways that matter. The abstract does not mention the sizes of the models used, how many random seeds or runs they averaged over, or any statistical tests for the outperformance. This makes it difficult to assess if the results are robust or if they could be due to the particular setups. The central assumption that the deep layer stability pattern will hold more broadly is not tested in the reported work, so if it doesn't apply to other model scales or tasks, the allocation advice loses its foundation. This work is mainly for practitioners who need low-cost continued pre-training methods and are looking for heuristics to try. A reader interested in efficient LLM adaptation might find the LayerTracer idea useful to build on or test themselves. Overall, the paper shows clear thinking on the problem and honest empirical engagement, so it deserves a serious referee to evaluate the full experimental details and see if additional validation can be added. Cheers,

Referee Report

2 major / 1 minor

Summary. The paper introduces LayerTracer, an architecture-agnostic diagnostic framework that analyzes the evolution of layer-wise representations and stability in LLMs to locate task-execution positions and quantify sensitivity. Analysis of the studied models reveals that deep layers serve as critical, high-stability regions for task execution. Guided by this, the authors run three controlled continued pre-training trials comparing freeze-train allocations and report that training shallow layers while freezing deep layers outperforms both full-parameter fine-tuning and the reverse allocation on C-Eval and CMMLU; a hybrid-model case study is also presented to show preservation of pre-trained knowledge when high-quality modules are placed in deep layers.

Significance. If the deep-layer stability pattern generalizes, the work supplies an interpretable, low-cost heuristic for layer allocation during continued pre-training that could reduce compute for resource-constrained teams while preserving model capability. The empirical comparison on two standard Chinese benchmarks and the hybrid-model validation constitute concrete, actionable guidance; explicit credit is due for the controlled-trial design that directly contrasts allocation strategies rather than relying on post-hoc analysis alone.

major comments (2)

[Abstract] Abstract and the description of the three controlled continued pre-training trials: the claim of consistent outperformance on C-Eval and CMMLU is presented without any report of the number of random seeds, run-to-run variance, statistical significance tests, model sizes, or architectures used. These omissions are load-bearing because the central allocation recommendation rests on the reliability and generality of the observed superiority.
[Abstract / Experiments] The transition from LayerTracer analysis to the recommended shallow-train/deep-freeze strategy assumes that the identified stability ranking of deep layers will hold at other scales and on other tasks; no cross-scale or cross-benchmark validation of this assumption is described, leaving the justification for the allocation strategy dependent on the specific regime examined in the LayerTracer runs.

minor comments (1)

[Abstract] The phrase 'architecture-agnostic' in the abstract would benefit from an explicit statement of the model families and layer counts to which LayerTracer was applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of the LayerTracer framework and controlled allocation trials. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and the description of the three controlled continued pre-training trials: the claim of consistent outperformance on C-Eval and CMMLU is presented without any report of the number of random seeds, run-to-run variance, statistical significance tests, model sizes, or architectures used. These omissions are load-bearing because the central allocation recommendation rests on the reliability and generality of the observed superiority.

Authors: We agree that the abstract omits these details and that they are important for assessing the reliability of the central claims. Model sizes and architectures are described in the experimental setup of the full manuscript, but the number of random seeds, run-to-run variance, and statistical significance tests are not reported. In the revised manuscript we will add these elements to both the abstract and the experimental sections, including means and standard deviations across multiple seeds together with the results of significance testing. revision: yes
Referee: [Abstract / Experiments] The transition from LayerTracer analysis to the recommended shallow-train/deep-freeze strategy assumes that the identified stability ranking of deep layers will hold at other scales and on other tasks; no cross-scale or cross-benchmark validation of this assumption is described, leaving the justification for the allocation strategy dependent on the specific regime examined in the LayerTracer runs.

Authors: The LayerTracer analysis and the three controlled trials were performed on the specific models and benchmarks reported in the paper; the allocation recommendation follows directly from the stability patterns and performance results observed in that regime. We acknowledge that the manuscript contains no cross-scale or additional cross-benchmark experiments that would support broader generalization. In the revision we will explicitly delimit the scope of the findings to the examined models and tasks and will add a statement that validation at other scales remains future work. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical analysis and trials are self-contained

full rationale

The paper's core contribution is an empirical diagnostic (LayerTracer) that measures layer representations and stability on specific models/tasks, followed by three controlled continued pre-training experiments comparing freeze/train allocations on C-Eval and CMMLU. No equations, fitted parameters, or self-citations are invoked to derive the allocation recommendation; the superiority claim rests on direct experimental comparison rather than any reduction to inputs by construction. Generalization beyond the tested regime is an external-validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim rests on the empirical generalization of LayerTracer patterns rather than on any derived constants or new postulated objects.

pith-pipeline@v0.9.0 · 5735 in / 1145 out tokens · 25576 ms · 2026-05-25T06:38:39.646333+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 19 internal anchors

[1]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Qwen3.5 official repository , author=

work page
[5]

GPT-4 Technical Report

GPT-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

LLaMA 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

The Llama 3 Herd of Models

The LLaMA 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[10]

International Conference on Learning Representations (ICLR) , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. International Conference on Learning Representations (ICLR) , year=

work page
[11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality , author=. arXiv preprint arXiv:2405.21060 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

International Conference on Learning Representations (ICLR) , year=

Linear attention is (maybe) all you need (to understand transformer optimization) , author=. International Conference on Learning Representations (ICLR) , year=

work page
[13]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[14]

Jamba: A Hybrid Transformer-Mamba Language Model

Jamba: A hybrid transformer-mamba language model , author=. arXiv preprint arXiv:2403.19887 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

International Conference on Learning Representations (ICLR) , year=

Gated delta networks: Improving Mamba2 with delta rule , author=. International Conference on Learning Representations (ICLR) , year=

work page
[16]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , publisher=

Distinguishing antonyms and synonyms in a pattern-based neural network , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , publisher=

work page
[17]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Neural Computing and Applications , volume =

Abo El-Enen, Mohamed and Saad, Sally and Nazmy, Taymoor , title =. Neural Computing and Applications , volume =. 2025 , publisher =

work page 2025
[19]

arXiv preprint arXiv:2504.05652 , year=

Sugar-coated poison: Benign generation unlocks llm jailbreaking , author=. arXiv preprint arXiv:2504.05652 , year=

work page arXiv
[20]

IEEE Transactions on Information Theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information Theory , volume=

work page
[21]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[22]

Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

Neural chain-of-thought search: Searching the optimal reasoning path to enhance large language models , author=. arXiv preprint arXiv:2601.11340 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page
[24]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Information-Theoretic Probing for Linguistic Structure , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[25]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

work page
[26]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[27]

Advances in Neural Information Processing Systems , volume=

Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Advances in Neural Information Processing Systems , volume=

work page
[28]

International Conference on Machine Learning , year=

Mechanistic Understanding of Alignment , author=. International Conference on Machine Learning , year=

work page
[29]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

work page 2024
[30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

work page
[31]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models , author=. 2601.14004 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

2506.07240 , archivePrefix=

Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs , author=. 2506.07240 , archivePrefix=

work page arXiv
[33]

Proceedings of the 2025 International Conference on Computational Linguistics , year=

Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models , author=. Proceedings of the 2025 International Conference on Computational Linguistics , year=

work page 2025
[34]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

LLM Already Knows Difficulty , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

work page 2025
[35]

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page
[36]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Locating and editing factual associations in GPT , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[37]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Finding neurons in a haystack: Case studies with sparse probing , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page
[39]

International Conference on Machine Learning (ICML) , pages=

Similarity of neural network representations revisited , author=. International Conference on Machine Learning (ICML) , pages=

work page
[40]

Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings) , pages=

A survey on the robustness of large language models , author=. Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings) , pages=

work page
[41]

International Conference on Learning Representations (ICLR) , year=

Mass editing memory in a transformer , author=. International Conference on Learning Representations (ICLR) , year=

work page
[42]

International Conference on Learning Representations (ICLR) , year=

Linearity of relation decoding in transformer language models , author=. International Conference on Learning Representations (ICLR) , year=

work page
[43]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

What do probes actually probe? On the role of surface statistics in linguistic probing , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[44]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Layer-wise analysis of knowledge distillation in large language models , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page
[45]

arXiv preprint arXiv:2308.14346 , year=

Disc-medllm: Bridging general large language models and real-world medical consultation , author=. arXiv preprint arXiv:2308.14346 , year=

work page arXiv
[46]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2311.09774 , year=

Huatuogpt-ii, one-stage training for medical adaption of llms , author=. arXiv preprint arXiv:2311.09774 , year=

work page arXiv
[49]

Findings of the association for computational linguistics: acl 2024 , year=

Biomistral: A collection of open-source pretrained large language models for medical domains , author=. Findings of the association for computational linguistics: acl 2024 , year=

work page 2024
[50]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2106.09685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace

Zhang, Jia-Chen and Xiong, Yu-Jie and Xia, Chun-Ming and Zhu, Dong-Hai and Qiu, Xi-He. Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[52]

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. 2023

work page 2023
[53]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page
[55]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

work page 2025
[56]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Advances in Neural Information Processing Systems , year=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. Advances in Neural Information Processing Systems , year=

work page
[58]

Findings of the Association for Computational Linguistics

CMMLU : Measuring massive multitask language understanding in C hinese. Findings of the Association for Computational Linguistics. 2024

work page 2024
[59]

2024 , eprint=

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models , author=. 2024 , eprint=

work page 2024

[1] [1]

Qwen2.5 Technical Report

Qwen2.5 technical report , author=. arXiv preprint arXiv:2412.15115 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Qwen2 Technical Report

Qwen2 technical report , author=. arXiv preprint arXiv:2407.10671 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Qwen3.5 official repository , author=

work page

[5] [5]

GPT-4 Technical Report

GPT-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

LLaMA: Open and Efficient Foundation Language Models

LLaMA: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

LLaMA 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

The Llama 3 Herd of Models

The LLaMA 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Attention is all you need , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[10] [10]

International Conference on Learning Representations (ICLR) , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. International Conference on Learning Representations (ICLR) , year=

work page

[11] [11]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality , author=. arXiv preprint arXiv:2405.21060 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

International Conference on Learning Representations (ICLR) , year=

Linear attention is (maybe) all you need (to understand transformer optimization) , author=. International Conference on Learning Representations (ICLR) , year=

work page

[13] [13]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page

[14] [14]

Jamba: A Hybrid Transformer-Mamba Language Model

Jamba: A hybrid transformer-mamba language model , author=. arXiv preprint arXiv:2403.19887 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

International Conference on Learning Representations (ICLR) , year=

Gated delta networks: Improving Mamba2 with delta rule , author=. International Conference on Learning Representations (ICLR) , year=

work page

[16] [16]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , publisher=

Distinguishing antonyms and synonyms in a pattern-based neural network , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , publisher=

work page

[17] [17]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Neural Computing and Applications , volume =

Abo El-Enen, Mohamed and Saad, Sally and Nazmy, Taymoor , title =. Neural Computing and Applications , volume =. 2025 , publisher =

work page 2025

[19] [19]

arXiv preprint arXiv:2504.05652 , year=

Sugar-coated poison: Benign generation unlocks llm jailbreaking , author=. arXiv preprint arXiv:2504.05652 , year=

work page arXiv

[20] [20]

IEEE Transactions on Information Theory , volume=

Divergence measures based on the Shannon entropy , author=. IEEE Transactions on Information Theory , volume=

work page

[21] [21]

Scaling Laws for Neural Language Models

Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[22] [22]

Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

Neural chain-of-thought search: Searching the optimal reasoning path to enhance large language models , author=. arXiv preprint arXiv:2601.11340 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Computational Linguistics , volume=

Probing classifiers: Promises, shortcomings, and advances , author=. Computational Linguistics , volume=

work page

[24] [24]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Information-Theoretic Probing for Linguistic Structure , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page

[25] [25]

Computational Linguistics , volume=

Probing Classifiers: Promises, Shortcomings, and Advances , author=. Computational Linguistics , volume=

work page

[26] [26]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023

[27] [27]

Advances in Neural Information Processing Systems , volume=

Finding Neurons in a Haystack: Case Studies with Sparse Probing , author=. Advances in Neural Information Processing Systems , volume=

work page

[28] [28]

International Conference on Machine Learning , year=

Mechanistic Understanding of Alignment , author=. International Conference on Machine Learning , year=

work page

[29] [29]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

How Large Language Models Encode Context Knowledge? A Layer-Wise Probing Study , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

work page 2024

[30] [30]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

Analyzing LLMs’ Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representation , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics , year=

work page

[31] [31]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models , author=. 2601.14004 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

2506.07240 , archivePrefix=

Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs , author=. 2506.07240 , archivePrefix=

work page arXiv

[33] [33]

Proceedings of the 2025 International Conference on Computational Linguistics , year=

Linguistic Minimal Pairs Elicit Linguistic Similarity in Large Language Models , author=. Proceedings of the 2025 International Conference on Computational Linguistics , year=

work page 2025

[34] [34]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

LLM Already Knows Difficulty , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , year=

work page 2025

[35] [35]

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

Transformer feed-forward layers are key-value memories , author=. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

work page

[36] [36]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Locating and editing factual associations in GPT , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[37] [37]

Qwen Technical Report

Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Advances in Neural Information Processing Systems (NeurIPS) , pages=

Finding neurons in a haystack: Case studies with sparse probing , author=. Advances in Neural Information Processing Systems (NeurIPS) , pages=

work page

[39] [39]

International Conference on Machine Learning (ICML) , pages=

Similarity of neural network representations revisited , author=. International Conference on Machine Learning (ICML) , pages=

work page

[40] [40]

Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings) , pages=

A survey on the robustness of large language models , author=. Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings) , pages=

work page

[41] [41]

International Conference on Learning Representations (ICLR) , year=

Mass editing memory in a transformer , author=. International Conference on Learning Representations (ICLR) , year=

work page

[42] [42]

International Conference on Learning Representations (ICLR) , year=

Linearity of relation decoding in transformer language models , author=. International Conference on Learning Representations (ICLR) , year=

work page

[43] [43]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

What do probes actually probe? On the role of surface statistics in linguistic probing , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page

[44] [44]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

Layer-wise analysis of knowledge distillation in large language models , author=. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , pages=

work page

[45] [45]

arXiv preprint arXiv:2308.14346 , year=

Disc-medllm: Bridging general large language models and real-world medical consultation , author=. arXiv preprint arXiv:2308.14346 , year=

work page arXiv

[46] [46]

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

Meditron-70b: Scaling medical pretraining for large language models , author=. arXiv preprint arXiv:2311.16079 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

Code Llama: Open Foundation Models for Code

Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2311.09774 , year=

Huatuogpt-ii, one-stage training for medical adaption of llms , author=. arXiv preprint arXiv:2311.09774 , year=

work page arXiv

[49] [49]

Findings of the association for computational linguistics: acl 2024 , year=

Biomistral: A collection of open-source pretrained large language models for medical domains , author=. Findings of the association for computational linguistics: acl 2024 , year=

work page 2024

[50] [50]

LoRA: Low-Rank Adaptation of Large Language Models

LoRA: Low-Rank Adaptation of Large Language Models , author=. 2106.09685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace

Zhang, Jia-Chen and Xiong, Yu-Jie and Xia, Chun-Ming and Zhu, Dong-Hai and Qiu, Xi-He. Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025

[52] [52]

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. 2023

work page 2023

[53] [53]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

work page

[55] [55]

Nature , volume=

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

work page 2025

[56] [56]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Advances in Neural Information Processing Systems , year=

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models , author=. Advances in Neural Information Processing Systems , year=

work page

[58] [58]

Findings of the Association for Computational Linguistics

CMMLU : Measuring massive multitask language understanding in C hinese. Findings of the Association for Computational Linguistics. 2024

work page 2024

[59] [59]

2024 , eprint=

CCI3.0-HQ: a large-scale Chinese dataset of high quality designed for pre-training large language models , author=. 2024 , eprint=

work page 2024