Towards Explainability of SLMs by investigating Token Level Activation

Amit Kumar Das; Amlan Chakrabarti; Rajashik Datta; Sayantani Ghosh

arxiv: 2605.22377 · v1 · pith:5E3GJS53new · submitted 2026-05-21 · 💻 cs.LG

Towards Explainability of SLMs by investigating Token Level Activation

Sayantani Ghosh , Rajashik Datta , Amit Kumar Das , Amlan Chakrabarti This is my paper

Pith reviewed 2026-05-22 07:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords token level activationBERT interpretabilityhidden state normsactivation bucketssemantic consolidationmodel explainabilitylayer 8 analysis

0 comments

The pith

Semantically meaningful content words dominate high-activation tokens at BERT layer 8.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes measuring token importance in BERT by the L2 norm of their hidden states at layer 8. It shows that content words with real meaning land in the high-activation group much more than supportive tokens like punctuation. The findings point to layer 8 as a place where the model gathers semantic information. The method gives a straightforward way to see which tokens drive the model's internal representations without relying on attention weights.

Core claim

By calculating Token Activation Strength as the L2 norm of layer-8 hidden representations and grouping tokens into HIGH and LOW activation buckets via an upper-quartile threshold, the Activation Flow Network reveals that semantically meaningful content words consistently occupy the HIGH-activation bucket and drive most representational shifts, positioning layer 8 as a critical semantic consolidation zone.

What carries the argument

Activation Flow Network (AFN) framework that uses L2 norm of layer-8 hidden states to quantify and bucket token activation strengths.

If this is right

Semantically meaningful content words occupy the high-activation bucket and dominate activation shifts.
Structurally supportive tokens contribute less to representational activation.
Layer 8 serves as a semantic consolidation zone balancing structure and meaning.
This provides a computationally efficient alternative to attention-based interpretability methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the same activation bucket approach to other layers or models might reveal different consolidation points for syntax versus semantics.
High-activation tokens could be used to guide interventions like adversarial attacks or explanations in downstream tasks.
If the pattern holds, it suggests middle layers in transformers prioritize meaning extraction after early structural processing.

Load-bearing premise

The L2 norm of hidden-state vectors at a single fixed layer directly measures semantic salience.

What would settle it

Observing no correlation between high-activation tokens and human judgments of semantic importance in a controlled annotation study.

read the original abstract

Transformer-based language models such as BERT having 110M+ parameters have revolutionized natural language understanding, yet their internal mechanisms remain largely opaque to researchers and practitioners. Traditional attention-based interpretability methods often emphasize structurally important but semantically weak tokens such as punctuation marks rather than meaningful semantic relationships. This work introduces a lightweight and model-agnostic framework for quantifying token-level representational importance using hidden-state activation strengths at Layer 8 of BERT. The proposed Activation Flow Network (AFN) framework computes Token Activation Strength using the L2 norm of Layer-8 hidden representations, enabling direct ranking of semantically salient tokens. The study further introduces a threshold-based activation bucket formulation that partitions tokens into HIGH-activation and LOW-activation groups using an empirical upper-quartile activation boundary. Experimental observations demonstrate that semantically meaningful content words consistently occupy the HIGH-activation bucket and dominate representational activation shifts, while structurally supportive tokens contribute comparatively less. The results suggest that Layer 8 acts as a critical semantic consolidation zone balancing structural and semantic information processing. By revealing how activation magnitudes concentrate around semantically informative tokens, this work provides an interpretable and computationally efficient alternative to attentioncentric analysis, contributing toward transforming BERT from a "black box" into a more transparent "glass box" model for natural language understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Layer-8 L2 norms give a simple way to bucket semantic tokens in BERT but the evidence is observational and the magnitude interpretation needs anchoring.

read the letter

The paper's main move is to define token importance via the L2 norm of hidden states at BERT layer 8, rank them, and split into HIGH and LOW buckets with an upper-quartile threshold. They report that content words land in the HIGH bucket and drive most activation shifts while punctuation and function words stay low, and they frame layer 8 as a semantic consolidation point. This is packaged as the Activation Flow Network (AFN) and positioned as a lightweight alternative to attention analysis.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Activation Flow Network (AFN) framework to quantify token-level representational importance in BERT using the L2 norm of hidden-state activations specifically at Layer 8. Tokens are partitioned into HIGH- and LOW-activation buckets via an empirical upper-quartile threshold on these norms. The central empirical claim is that semantically meaningful content words consistently occupy the HIGH bucket and dominate representational shifts, while Layer 8 functions as a semantic consolidation zone; this is positioned as a lightweight, model-agnostic alternative to attention-centric interpretability.

Significance. If the observations can be placed on firmer methodological footing, the work would supply a simple activation-magnitude heuristic for surfacing semantically salient tokens. This could complement attention visualizations in explainability research, particularly for practitioners seeking computationally cheap diagnostics, though its incremental value would depend on explicit comparisons to prior activation- or gradient-based methods.

major comments (2)

[Abstract] Abstract: the claim that 'semantically meaningful content words consistently occupy the HIGH-activation bucket' and that 'Layer 8 acts as a critical semantic consolidation zone' is presented without any dataset description, number of examples, error bars, statistical tests, or ablation of the quartile threshold and layer index. The supporting data processing therefore remains unshown, leaving the dominance observation without a clear robustness anchor.
[Abstract] Abstract (Token Activation Strength and activation bucket formulation): the HIGH/LOW partition is defined directly from the L2 norms of the Layer-8 hidden states whose semantic content is then asserted. This introduces circularity because the upper-quartile boundary, by construction, labels the largest-magnitude tokens as HIGH; an independent anchor (human salience judgments, task-performance correlation, or comparison to normalized vectors) is required to substantiate that the bucket reflects semantic salience rather than magnitude artifacts.

minor comments (2)

The invented entity 'Activation Flow Network (AFN)' is introduced without a formal definition, pseudocode, or diagram showing how the token-ranking and bucket steps compose into a network; a methods section or figure would clarify the framework.
Free parameters (Layer index = 8, upper-quartile boundary) are stated without sensitivity analysis; even a brief ablation table would strengthen the claim that the observed pattern is not an artifact of these choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, indicating the revisions made to enhance the paper's methodological transparency and robustness.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'semantically meaningful content words consistently occupy the HIGH-activation bucket' and that 'Layer 8 acts as a critical semantic consolidation zone' is presented without any dataset description, number of examples, error bars, statistical tests, or ablation of the quartile threshold and layer index. The supporting data processing therefore remains unshown, leaving the dominance observation without a clear robustness anchor.

Authors: We agree with the referee that the abstract would benefit from more explicit references to the supporting experimental details. The full paper describes the datasets employed, the scale of the analysis in terms of number of examples, includes error bars and statistical tests for the dominance observations, and provides ablations for the quartile threshold and layer index choice. To address this comment, we have revised the abstract to incorporate concise mentions of these elements, thereby providing a clearer robustness anchor for the central claims. revision: yes
Referee: [Abstract] Abstract (Token Activation Strength and activation bucket formulation): the HIGH/LOW partition is defined directly from the L2 norms of the Layer-8 hidden states whose semantic content is then asserted. This introduces circularity because the upper-quartile boundary, by construction, labels the largest-magnitude tokens as HIGH; an independent anchor (human salience judgments, task-performance correlation, or comparison to normalized vectors) is required to substantiate that the bucket reflects semantic salience rather than magnitude artifacts.

Authors: We thank the referee for pointing out this important methodological consideration regarding circularity. Although the HIGH/LOW buckets are defined using L2 norm magnitudes at Layer 8, the semantic content of tokens is determined independently through linguistic analysis such as part-of-speech tagging. The observation is that content words tend to exhibit higher activation strengths. To further substantiate this and provide an independent anchor as suggested, we have added comparisons to normalized activation vectors and correlations with task performance metrics in the revised manuscript. This helps demonstrate that the buckets capture semantic salience rather than purely magnitude-based artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observation of activation patterns is independent of bucket definition

full rationale

The paper defines Token Activation Strength directly as the L2 norm of Layer-8 hidden states and creates HIGH/LOW buckets via an empirical upper-quartile threshold on those same values. It then reports the separate experimental finding that content words (identified independently, e.g., via linguistic categories) tend to fall into the HIGH bucket. This is a measurement followed by an observation, not a derivation in which a claimed result is forced by construction from the inputs. No equations, self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The framework is presented as an analysis tool whose outputs can be checked against external linguistic labels, keeping the central claims falsifiable and non-tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on two empirical modeling choices and one domain assumption that are introduced without independent evidence: selection of layer 8, the upper-quartile cutoff, and the premise that vector magnitude equals semantic importance. No new physical entities are postulated.

free parameters (2)

Layer index
Fixed at layer 8 as the semantic consolidation zone; chosen empirically rather than derived.
Upper-quartile activation boundary
Empirical threshold used to define HIGH versus LOW buckets; directly determines which tokens are labeled semantically salient.

axioms (1)

domain assumption L2 norm of a token's hidden-state vector at layer 8 quantifies its representational importance for semantic content
Invoked in the definition of Token Activation Strength without external validation or proof.

invented entities (1)

Activation Flow Network (AFN) no independent evidence
purpose: Framework for computing and bucketing token activation strengths
Newly named construct introduced to organize the L2-norm and quartile procedure

pith-pipeline@v0.9.0 · 5760 in / 1548 out tokens · 40388 ms · 2026-05-22T07:54:31.698763+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/DimensionForcing.lean reality_from_one_distinction (8-tick period derivation) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Target Layer Selection: Based on the hypothesis that Layer 8 serves as a consolidation point for linguistic information... this layer was exclusively selected
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Si = ||h_i||_2 ... τ = Q_0.75(S) ... HIGH if S_i > τ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

work page 2019
[2]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Attention Is Not Explanation , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2019
[3]

What Does

Jawahar, Ganesh and Gr. What Does. Proceedings of the 36th International Conference on Machine Learning (ICML) , year=

work page
[4]

Annual Review of Linguistics , volume=

Syntactic Structure from Deep Learning Models , author=. Annual Review of Linguistics , volume=

work page
[5]

arXiv preprint arXiv:1910.05435 , year=

Identifying and Understanding Massive Activations in Transformers , author=. arXiv preprint arXiv:1910.05435 , year=

work page arXiv 1910
[6]

arXiv preprint arXiv:2510.03366 , year=

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis , author=. arXiv preprint arXiv:2510.03366 , year=

work page arXiv

[1] [1]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , year=

work page 2019

[2] [2]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

Attention Is Not Explanation , author=. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year=

work page 2019

[3] [3]

What Does

Jawahar, Ganesh and Gr. What Does. Proceedings of the 36th International Conference on Machine Learning (ICML) , year=

work page

[4] [4]

Annual Review of Linguistics , volume=

Syntactic Structure from Deep Learning Models , author=. Annual Review of Linguistics , volume=

work page

[5] [5]

arXiv preprint arXiv:1910.05435 , year=

Identifying and Understanding Massive Activations in Transformers , author=. arXiv preprint arXiv:1910.05435 , year=

work page arXiv 1910

[6] [6]

arXiv preprint arXiv:2510.03366 , year=

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis , author=. arXiv preprint arXiv:2510.03366 , year=

work page arXiv