arxiv: 2605.08504 · v2 · submitted 2026-05-08 · 💻 cs.CL

Recognition: unknown

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Zeru Shi , Zhenting Wang , Fan Yang , Qifan Wang , Ruixiang Tang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords massive activationsmassive emergence layerlarge language modelsresidual connectionsattention mechanismattention sinksRMSNormfeed forward networks

0 comments

The pith

Massive activations in large language models first appear in one specific layer and stay largely fixed afterward, reducing input diversity to attention, but a softening method improves performance on key tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that massive activations arise in a particular layer present in many large language models. At this Massive Emergence Layer, RMSNorm and the feed-forward network weights jointly generate the large activations, which then propagate unchanged via residual connections. The fixed representation reduces the variety of hidden states reaching the attention modules in deeper layers. Motivated by this, the authors develop a simple softening technique that makes the massive token less rigid and report better results on instruction-following and math-reasoning benchmarks in both training-free and fine-tuning scenarios. They further demonstrate that the technique weakens attention sinks, linking the two phenomena at the hidden-state level.

Core claim

Massive activations emerge at a designated layer called the Massive Emergence Layer. Both the layer's RMSNorm and its feed-forward network parameters contribute to their creation. The activations then spread to later layers through residual connections while keeping nearly the same values, which lowers the diversity of the inputs seen by attention. A simple softening procedure applied to the massive activation token removes some of this rigidity and produces consistent gains on instruction following and math reasoning tasks.

What carries the argument

The Massive Emergence Layer, the point at which massive activations first form through the combined effect of RMSNorm and FFN and then remain invariant as they propagate via residuals.

Load-bearing premise

The analysis assumes that the Massive Emergence Layer is the causal origin of the massive activations and their downstream effects on representation diversity and performance, as opposed to being a correlated but non-causal observation.

What would settle it

Finding that massive activations of similar magnitude appear in layers before the proposed ME Layer, or that applying the softening method to non-ME layers produces equivalent performance gains.

Figures

Figures reproduced from arXiv: 2605.08504 by Fan Yang, Qifan Wang, Ruixiang Tang, Zeru Shi, Zhenting Wang.

**Figure 1.** Figure 1: This figure illustrates how massive activations emerge and propagate. In the top panel, we trace the flow of massive activations: they arise only at the FFN of a specific layer and then propagate to subsequent layers through residual connections. The → arrows denote the generation and propagation of massive activations. The bottom panel shows how the output ℓ2 norm changes across layers. ME Layer means Mas… view at source ↗

**Figure 2.** Figure 2: The comparison of the magnification of RMSNorm on token0 and other tokens in Qwen3-4B across layers. Amplification effect of RMSNorm. We analyze the scaling factors in RMSNorm layer by layer and find that the amplification effect in the ME Layer on the hidden state far exceeds that of other layers. In [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Line chart(the y-axis on the left) shows difference of the projection concentration between first token and others after different module in FFN. Bar chart(the y-axis on the right) shows the amplification factor of the MLP on the token hidden state. the hidden state is along a small subset of representation dimensions after the FFN transformation. A higher projection concentration indicates that the resul… view at source ↗

**Figure 5.** Figure 5: (a) L2 norm of the first token’s hidden state across layers for different input instances. (b) The activation of token 0 in different layer of model. Red line indicates the activation of ME Layer (c) Heatmap of the cosine similarity between different input’s first-token hidden state across layers. 3.2. The Direction of Massive Activation Key Takeaway Once the massive activation emerges at the ME Layer the … view at source ↗

**Figure 6.** Figure 6: This is the schematic diagram of our methods. We will choose top-k dimensions based on weights then masking the corresponding dimensions in hidden state. 5. Experiments 5.1. Settings Method Details and Training Setups: We adopt Qwen3- 4B as the base model and apply our method both as a training-free inference-time technique and as a trainingtime strategy across multiple tasks, including instruction fine-t… view at source ↗

**Figure 8.** Figure 8: (a) shows the attention heatmap without our method. (b) shows the attention heatmap with our method. Motivated by this connection, we further investigate the relationship between massive activation onset, our proposed intervention, and the emergence of attention sinks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: (a) shows heatmap of attention weights in the ME Layer (layer 7). (b) shows the layer after ME Layer (layer 8). Our findings share similarities with prior studies on attention sinks. Previous works, such as Qiu et al. (2025) and Gu et al. (2024), show that attention weights are often heavily concentrated on a single token across multiple heads. This concentration implies a low-rank structure in the attenti… view at source ↗

**Figure 9.** Figure 9: The hidden state of the output of DecoderLayer, left figure remove FFN in ME Layer middle figure remove RMSNorm in ME Layer right figure contains all module. C. More Experiment Settings During training, WeMask is applied to every layer following the onset of massive activation. In contrast, during evaluation, we adopt different configurations depending on the task type. For tasks that primarily assess the … view at source ↗

**Figure 10.** Figure 10: L2 norm of the first token across layers for different input instances. Each curve corresponds to a distinct example. E. Performance of Different Mask Methods In this section, we evaluate different masking strategies by incorporating them into the inference stage as training-free interventions, in order to examine their impact on model performance. For each masking method, we adopt the mask ratio that yie… view at source ↗

**Figure 11.** Figure 11: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen3-8B 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen3-4B-Instruct Output of RMSNorm Output of FFN Output of DecoderLayer Qwen2.5-7B [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2.5-7B Output of RMSNorm Output of FFN Output of DecoderLayer Qwen2.5-7B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2.5-7B-Instruct 16 [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2.5-32B Output of RMSNorm Output of FFN Output of DecoderLayer Llama3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Llama3.1-8B Output of RMSNorm Output of FFN Output of DecoderLayer Llama3.1-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Llama3.1-8B-Instruct 17 [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Mistral-7B-v0.1. Output of RMSNorm Output of FFN Output of DecoderLayer Deepseek-llm-7b-chat [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

**Figure 19.** Figure 19: The hidden state of the output of RMSNorm, FFN and Decoderlayer on DeepSeek-llm-7b-chat. Output of RMSNorm Output of FFN Output of DecoderLayer Phi-3-mini-4k-instruct [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: The hidden state of the output of RMSNorm, FFN and Decoderlayer on Phi3-mini-4k-instruct. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

read the original abstract

We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper locates massive activations to one consistent layer across LLMs and shows a cheap softening fix that lifts reasoning and instruction results, though the causal story stays mostly observational.

read the letter

The main point is that massive activations appear first at a single layer they call the ME Layer, driven jointly by RMSNorm and the FFN, then stay largely unchanged through later layers via residuals. This reduces representation diversity for attention, and the authors link it to attention sinks. Their softening method on those tokens improves performance on instruction following and math reasoning in both training-free and fine-tuning cases.

Referee Report

3 major / 2 minor

Summary. The manuscript identifies a consistent 'Massive Emergence Layer' (ME Layer) across LLM families where massive activations first emerge from the joint contribution of RMSNorm and FFN parameters, then propagate invariantly via residual connections, reducing hidden-state diversity to attention modules. Motivated by this, the authors propose a softening method that improves performance on instruction-following and math-reasoning tasks in both training-free and fine-tuning regimes and mitigates attention sinks.

Significance. If the ME Layer is shown to be causal and the softening method acts specifically on this mechanism, the work would supply a simple, cross-family explanation for representation collapse and attention sinks together with a practical intervention. The observational consistency across families is a positive feature, yet the current descriptive evidence limits the strength of the central claims.

major comments (3)

[Section 3.2] Section 3.2: The claim that the ME Layer is the causal origin of downstream invariance rests on observational identification and joint-parameter analysis; no targeted intervention (e.g., zeroing or scaling only the ME Layer's RMSNorm/FFN outputs while freezing all other layers) is reported to test whether emergence and invariance can be prevented or reproduced.
[Section 4] Section 4: Performance gains from the softening method are stated without effect sizes, ablation controls against generic activation damping, or statistical significance tests, so it remains unclear whether improvements arise specifically from action on the ME mechanism rather than unrelated side effects.
[Section 5] Section 5 / Figure 5: The invariance of massive-activation token representations is shown descriptively; no quantitative diversity metric (e.g., layer-wise cosine similarity or entropy of hidden states) is supplied to support the claim that diversity passed to attention is materially reduced.

minor comments (2)

[Abstract] Abstract: 'Massive activations' should be defined with an explicit numerical threshold or percentile criterion at first use.
[Method] Notation: The precise definition of the softening operation (e.g., scaling factor or replacement value) should be given in equation form rather than prose only.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and recommendation for major revision. The suggestions will help strengthen the causal evidence for the ME Layer and the empirical specificity of our softening method. We address each point below and commit to the indicated revisions.

read point-by-point responses

Referee: [Section 3.2] Section 3.2: The claim that the ME Layer is the causal origin of downstream invariance rests on observational identification and joint-parameter analysis; no targeted intervention (e.g., zeroing or scaling only the ME Layer's RMSNorm/FFN outputs while freezing all other layers) is reported to test whether emergence and invariance can be prevented or reproduced.

Authors: We agree that a targeted intervention would provide stronger causal support beyond the observational consistency and joint-parameter analysis currently in Section 3.2. The manuscript shows emergence via RMSNorm+FFN and invariance via residuals, but does not isolate the ME Layer through freezing. In revision we will add experiments that scale or zero only the ME Layer RMSNorm/FFN outputs (freezing all other layers) and measure whether downstream massive activations are prevented or reproduced. These results will be reported in an expanded Section 3.2. revision: yes
Referee: [Section 4] Section 4: Performance gains from the softening method are stated without effect sizes, ablation controls against generic activation damping, or statistical significance tests, so it remains unclear whether improvements arise specifically from action on the ME mechanism rather than unrelated side effects.

Authors: We accept this critique. The current manuscript reports improvements on instruction-following and math tasks but omits effect sizes, generic-damping ablations, and significance testing. In the revised version we will add (i) absolute effect sizes, (ii) ablations contrasting our ME-targeted softening against uniform activation damping, and (iii) paired t-tests across multiple random seeds. These additions will appear in Section 4 and will clarify that gains are attributable to the ME mechanism. revision: yes
Referee: [Section 5] Section 5 / Figure 5: The invariance of massive-activation token representations is shown descriptively; no quantitative diversity metric (e.g., layer-wise cosine similarity or entropy of hidden states) is supplied to support the claim that diversity passed to attention is materially reduced.

Authors: We agree that quantitative metrics are needed to move beyond the descriptive evidence in Section 5 and Figure 5. In revision we will compute and report layer-wise cosine similarity of massive-activation hidden states together with entropy of the hidden-state distribution across layers. These metrics will be added to Section 5 to quantify the reduction in representational diversity passed to attention. revision: yes

Circularity Check

0 steps flagged

No significant circularity in observational identification and method proposal

full rationale

The paper's core claims rest on empirical observation: consistent identification of the ME Layer across model families, joint RMSNorm+FFN contribution shown via parameter analysis, post-formation invariance of token representations, and a proposed softening method that is motivated by but not mathematically derived from the observations. No equations or steps reduce a claimed prediction or first-principles result to a fitted input by construction, nor do load-bearing self-citations or ansatzes appear. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on empirical pattern detection in existing transformer architectures rather than new theoretical derivations or fitted constants; the only background assumptions are standard components of the transformer block.

axioms (2)

standard math Residual connections add each layer output to its input
Invoked when stating that massive activations propagate through residual connections
domain assumption RMSNorm and FFN are standard sub-modules in the transformer blocks examined
Used to attribute the joint contribution to emergence inside the ME Layer

invented entities (2)

Massive Emergence Layer (ME Layer) no independent evidence
purpose: Names the specific layer in which massive activations first appear
Newly coined label based on cross-model observations; no independent falsifiable prediction supplied
massive activation token representation no independent evidence
purpose: Refers to the hidden state of the token carrying the massive activation
Descriptive term for the observed invariant pattern; no external evidence provided

pith-pipeline@v0.9.0 · 5474 in / 1604 out tokens · 54436 ms · 2026-05-14T20:58:37.603735+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 11 internal anchors

[1]

Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H

URL https://huggingface.co/datasets/ AI-MO/aimo-validation-aime. Amini, A., Gabriel, S., Lin, S., Koncel-Kedziorski, R., Choi, Y ., and Hajishirzi, H. MathQA: Towards interpretable math word problem solving with operation-based for- malisms. InProceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguisti...

work page 2019
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Association for Compu- tational Linguistics. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245. Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback.arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1245
[3]

Value-state gated attention for mitigating extreme-token phenomena in transformers.arXiv preprint arXiv:2510.09017, 2025

Bu, R., Zhong, H., Chen, W., and Li, Y . Value-state gated at- tention for mitigating extreme-token phenomena in trans- formers.arXiv preprint arXiv:2510.09017,

work page arXiv
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

and V oita, E

Ferrando, J. and V oita, E. Information flow routes: Au- tomatically interpreting language models at scale. In Proceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pp. 17432–17445,

work page 2024
[7]

Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, and Antonios Saravanos

Gallego-Feliciano, J., McClendon, S. A., Morinelli, J., Zer- voudakis, S., and Saravanos, A. Hidden dynamics of mas- sive activations in transformer training.arXiv preprint arXiv:2508.03616,

work page arXiv
[8]

When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

9 A Single Layer to Explain Them All: Understanding Massive Activations in Large Language Models Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y ., and Lin, M. When attention sink emerges in language models: An empirical view.arXiv preprint arXiv:2410.10781,

work page arXiv
[9]

Massive values in self-attention modules are the key to contextual knowledge understanding,

Jin, M., Mei, K., Xu, W., Sun, M., Tang, R., Du, M., Liu, Z., and Zhang, Y . Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563,

work page arXiv
[10]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step.arXiv preprint arXiv:2305.20050,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

DeepSeek-V3 Technical Report

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

A refined analysis of massive activations in llms.arXiv preprint arXiv:2503.22329, 2025

Owen, L., Chowdhury, N. R., Kumar, A., and G ¨ura, F. A refined analysis of massive activations in llms.arXiv preprint arXiv:2503.22329,

work page arXiv
[13]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated atten- tion for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

Queipo-de Llano, E., Arroyo, ´A., Barbero, F., Dong, X., Bronstein, M., LeCun, Y ., and Shwartz-Ziv, R. Attention sinks and compression valleys in llms are two sides of the same coin.arXiv preprint arXiv:2510.06477,

work page arXiv
[15]

Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

Ramapuram, J., Danieli, F., Dhekane, E., Weers, F., Bus- bridge, D., Ablin, P., Likhomanenko, T., Digani, J., Gu, Z., Shidani, A., et al. Theory, analysis, and best practices for sigmoid self-attention.arXiv preprint arXiv:2409.04431,

work page arXiv
[16]

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

R¨ottger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi, F., and Hovy, D. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263,

work page internal anchor Pith review arXiv
[17]

What are you sink- ing? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546,

Ruscio, V ., Nanni, U., and Silvestri, F. What are you sink- ing? a geometric approach on attention sink.arXiv preprint arXiv:2508.02546,

work page arXiv
[18]

Forget- ting to forget: Attention sink as a gateway for backdoor- ing llm unlearning.arXiv preprint arXiv:2510.17021,

Shang, B., Chen, Y ., Zhang, Y ., Shen, B., and Liu, S. Forget- ting to forget: Attention sink as a gateway for backdoor- ing llm unlearning.arXiv preprint arXiv:2510.17021,

work page arXiv
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Meaningless tokens, meaningful gains: How activation shifts enhance llm reasoning.arXiv preprint arXiv:2510.01032,

Shi, Z., Wan, Y ., Wang, Z., Wang, Q., Yang, F., Kreiss, E., and Tang, R. Meaningless tokens, meaningful gains: How activation shifts enhance llm reasoning.arXiv preprint arXiv:2510.01032,

work page arXiv
[21]

N., and Tang, R

Shi, Z., Mei, K., Quan, Y ., Metaxas, D. N., and Tang, R. Improving visual reasoning with iterative evidence refine- ment.arXiv preprint arXiv:2603.14117,

work page arXiv
[22]

Pre- fixing attention sinks can mitigate activation outliers for large language model quantization.arXiv preprint arXiv:2406.12016,

Son, S., Park, W., Han, W., Kim, K., and Lee, J. Pre- fixing attention sinks can mitigate activation outliers for large language model quantization.arXiv preprint arXiv:2406.12016,

work page arXiv
[23]

Z., and Liu, Z

Sun, M., Chen, X., Kolter, J. Z., and Liu, Z. Massive activations in large language models.arXiv preprint arXiv:2402.17762,

work page arXiv
[24]

and Van Schijndel, M

Timkey, W. and Van Schijndel, M. All bark and no bite: Rogue dimensions in transformer language mod- els obscure representational quality.arXiv preprint arXiv:2109.04404,

work page arXiv
[25]

Qwen3 Technical Report

URL https:// openreview.net/forum?id=YfKNaRktan. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

GLM-130B: An Open Bilingual Pre-trained Model

Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y ., Zheng, W., Xia, X., et al. Glm-130b: An open bilingual pre-trained model.arXiv preprint arXiv:2210.02414,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

and Zhang, R

Zhang, B. and Zhang, R. Cot-uq: Improving response-wise uncertainty quantification in llms with chain-of-thought. InFindings of the Association for Computational Linguis- tics: ACL 2025, pp. 26114–26133,

work page 2025
[28]

Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025

Zhang, B., Yu, Y ., Guo, J., and Shao, J. Dive into the agent matrix: A realistic evaluation of self-replication risk in llm agents.arXiv preprint arXiv:2509.25302, 2025a. Zhang, S., Khan, M., and Papyan, V . Attention sinks: A’catch, tag, release’mechanism for embeddings. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. Zhan...

work page arXiv 2025
[29]

Y ., Appalaraju, S., Tang, P., Wu, Y

Zhao, T., Singh, K. Y ., Appalaraju, S., Tang, P., Wu, Y . N., and Li, L. E. On the analysis and distillation of emergent outlier properties in pre-trained language models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. ...

work page 2025
[30]

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Zuhri, Z. M., Fuadi, E. H., and Aji, A. F. Softpick: No atten- tion sink, no massive activations with rectified softmax. arXiv preprint arXiv:2504.20966,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

In contrast, our method consistently improves performance across benchmarks

We observe that, except for our method, all alternative masking strategies lead to a substantial degradation in model performance, often causing severe harm to the model’s reasoning ability. In contrast, our method consistently improves performance across benchmarks. These results demonstrate that indiscriminately masking dimensions—either randomly or bas...

work page 2021
[32]

These results demonstrate that WeMask generalizes well across different model architectures and reliably improves model performance

As shown in the table, compared to the training-free variant, the SFT-based WeMask approach exhibits more stable performance and consistently outperforms the standard SFT baselines across multiple benchmarks. These results demonstrate that WeMask generalizes well across different model architectures and reliably improves model performance. G. Compared wit...

work page 2025