pith. sign in

arxiv: 2604.05090 · v2 · submitted 2026-04-06 · 💻 cs.CL · cs.LG

Multilingual Language Models Encode Script Over Linguistic Structure

Pith reviewed 2026-05-10 18:58 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords multilingual language modelsorthographyscriptlinguistic structureLAPE metricsparse autoencoderssurface formrepresentations
0
0 comments X

The pith

Multilingual language models organize their internal representations around surface script and orthography rather than abstract linguistic structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multilingual language models process many languages inside one shared parameter space, raising the question of whether they group inputs by underlying grammar or by visible surface cues such as script. The authors measure this by tracking language-associated units with the Language Activation Probability Entropy metric and by decomposing activations through sparse autoencoders. They observe that romanizing text produces nearly separate sets of active units that match neither the original script nor English, whereas scrambling word order leaves unit identity largely unchanged. Typological features grow detectable only in deeper layers, yet the units that actually steer generation remain the ones stable across surface changes. This pattern indicates the models maintain surface-form distinctions rather than collapsing languages into a single abstract interlingua.

Core claim

Multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua. Language-associated units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate generation is most sensitive to units invariant to surface-form perturbations.

What carries the argument

The Language Activation Probability Entropy (LAPE) metric that scores how selectively units activate for particular languages, together with sparse autoencoder decompositions that isolate distinct activation patterns.

Load-bearing premise

The LAPE metric and sparse autoencoder decompositions accurately isolate orthographic effects from linguistic structure without introducing analysis artifacts, and the chosen perturbations separate surface form from deeper linguistic properties.

What would settle it

Finding that romanized inputs activate the same units as native-script versions or that word-order shuffling substantially alters unit identities would falsify the claim that surface form dominates representation organization.

Figures

Figures reproduced from arXiv: 2604.05090 by Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta, Tanmoy Chakraborty.

Figure 2
Figure 2. Figure 2: Jaccard similarity between Romanized and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-wise alignment between language￾associated units for Native and Romanized inputs in Llama-3.2-1B (see [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Jaccard similarity between language￾associated units identified from original and word￾shuffled text in Llama-3.2-1B (see [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average family-wise probing R2 scores across neuron subsets induced by romanization in Llama-3.2- 1B (raw neurons). Neurons overlapping between native and romanized inputs exhibit the strongest typological alignment, while script-specific subsets encode weaker signal. Baseline denotes probing over the pooled set of all neurons that were selected for either native or romanized inputs (across all layers), se… view at source ↗
Figure 7
Figure 7. Figure 7: shows that genealogical properties are Only Original Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Av e r a g e R 2 Phonology Syntax Genealogy (Family) [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Jaccard similarity between language￾associated units identified from Romanized inputs and those from Native-script or English inputs in Gemma-2- 2B. Results are shown for both raw neurons and SAE features. Romanized inputs exhibit low overlap with their native-script counterparts and near-zero overlap with English in both representations, indicating limited cross-script alignment without convergence to Eng… view at source ↗
Figure 9
Figure 9. Figure 9: Aggregate degree-3 Venn diagrams of language-specific neurons under orthographic variation for [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Aggregate degree-3 Venn diagrams of language-specific neurons under orthographic variation for [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Activation probability and entropy distri [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Mean activation statistics across languages for native and romanized inputs, for the raw MLP LAPE [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Llama-3.2-1B, SAE). Overall probing scores are lower, but overlap neurons remain dominant. across all model and representation configurations. Aggregate Neuron Overlap Under Shuffling. We first examine neuron overlap between features identified from original and word-shuffled inputs, aggregated across all language… view at source ↗
Figure 16
Figure 16. Figure 16: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Gemma-2-2B, raw MLP). Scores are closer across sub￾sets, with native-only neurons occasionally falling be￾low baseline. Only Native Overlap Only Romanized Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Av g. M e a n P r o b i n g S c o r e ( R 2 ) Phonology Syntax Genealogy (Family) [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Average family-wise maximum probing R2 scores across neuron subsets induced by romanization (Gemma-2-2B, SAE). Overlap neurons continue to show stronger typological alignment despite increased spar￾sity. observed at the language level in the main text also holds when aggregating across neurons. For Gemma SAE, the absolute number of identi￾fied neurons is small for certain languages, making low-degree over… view at source ↗
Figure 20
Figure 20. Figure 20: Activation entropy and selection proba￾bility distributions for original and shuffled inputs in Llama-3.2-1B. Top: raw MLP; Bottom: SAE. The near￾identical distributions indicate minimal distributional shift under shuffling. average family-wise maximum probing R2 score across neurons for the three typological feature families used throughout the paper: fam, syntax, and phonology. All plots report mean val… view at source ↗
Figure 21
Figure 21. Figure 21: Activation entropy and selection probability [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Mean activation entropy and selection probability across languages before and after shuffling. Top: [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 24
Figure 24. Figure 24: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Llama-3.2- 1B, SAE). Condition-specific subsets dominate overlap neurons; baseline scores remain lowest. Regression Setup. Probing is formulated as a set of univariate regression problems. For each neuron or feature n ∈ Nℓ and each typological dimension f, we fit a linear model across languages: y (k) f = βn,f x (k) n + ϵ… view at source ↗
Figure 25
Figure 25. Figure 25: Average family-wise maximum probing R2 scores across neuron subsets under shuffling (Gemma-2- 2B, raw MLP). Typological alignment is similar across normal-only, shuffled-only, and overlap subsets. Only Normal Overlap Only Shuffled Baseline 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Av g. M e a n P r o b i n g S c o r e ( R 2 ) Phonology Syntax Genealogy (Family) [PITH_FULL_IMAGE:figures/full_fig_p027_25.png] view at source ↗
Figure 27
Figure 27. Figure 27: Layerwise probing performance in Llama￾3.2-1B. Top: Raw MLP activations. Bottom: SAE fea￾tures. SAE representations are comparatively stronger in early layers, while raw activations dominate in later layers. F.2 Detailed Layerwise Probing Comparisons Here we provide a detailed layerwise analysis of probing results for the three typological feature families used in the final experiments: fam, syntax, and p… view at source ↗
Figure 28
Figure 28. Figure 28: Raw minus SAE probing score differences for Llama-3.2-1B. Negative values in shallow layers indicate higher SAE informativeness, while the gradual shift toward positive values reflects increasing raw dom￾inance with depth. Raw vs. SAE representations in Gemma. The corresponding Gemma plots are shown in Fig￾ures 29 and 30. Unlike Llama, Gemma exhibits a more stable relationship between raw and SAE represen… view at source ↗
Figure 29
Figure 29. Figure 29: Layerwise probing performance in Gemma￾2-2B. Top: Raw MLP activations. Bottom: SAE features. Raw representations dominate for fam and syntax, while SAE features retain stronger phonologi￾cal signals across layers. 1 3 5 7 9 11 13 15 17 19 21 23 25 Layer 0.4 0.2 0.0 0.2 0.4 Diffe r e n c e in Av e r a g e M a x R 2 Gemma-2-2B (Raw - SAE) Genealogy (Family) Syntax Phonology [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 30
Figure 30. Figure 30: Raw minus SAE probing score differ￾ences for Gemma-2-2B. Differences are stable across depth: positive for fam and syntax, and negative for phonology. Summary. These detailed comparisons show that sparse autoencoding reshapes typological structure in a depth-, model-, and feature-dependent man￾ner. Llama SAEs transiently enhance early-layer typological accessibility, Gemma SAEs selectively favor Phonology… view at source ↗
Figure 31
Figure 31. Figure 31: Cross-model comparison of probing perfor [PITH_FULL_IMAGE:figures/full_fig_p029_31.png] view at source ↗
Figure 33
Figure 33. Figure 33: Qualitative examples of model behavior under shuffling-based overlap ablation (Gemma-2-2B, raw). [PITH_FULL_IMAGE:figures/full_fig_p033_33.png] view at source ↗
Figure 35
Figure 35. Figure 35: Jaccard similarity between romanized and [PITH_FULL_IMAGE:figures/full_fig_p034_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Layer-wise alignment in Llama-3-8B. Mid [PITH_FULL_IMAGE:figures/full_fig_p035_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Layer-wise alignment in Gemma-2-9B, showing consistent representational separation in raw neurons across depth. H.3 Structural Robustness at Scale Finally, we examine whether larger models main￾tain the high robustness to structural (word-order) perturbations observed in 1B and 2B models. High Overlap Under Shuffling. As shown in [PITH_FULL_IMAGE:figures/full_fig_p035_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Jaccard similarity between units from origi [PITH_FULL_IMAGE:figures/full_fig_p035_38.png] view at source ↗
read the original abstract

Multilingual language models (LMs) organize representations for typologically and orthographically diverse languages into a shared parameter space, yet the nature of this internal organization remains elusive. In this work, we investigate which linguistic properties - abstract language identity or surface-form cues - shape multilingual representations. To do so, we analyze language-associated units across different model families and scales using the Language Activation Probability Entropy (LAPE) metric, and further decompose activations with Sparse Autoencoders. We find that these units are strongly conditioned on orthography: romanization induces near-disjoint representations that align with neither native-script inputs nor English, while word-order shuffling has limited effect on unit identity. Probing shows that typological structure becomes increasingly accessible in deeper layers, while causal interventions indicate that generation is most sensitive to units that are invariant to surface-form perturbations rather than to units identified by typological alignment alone. Overall, our results suggest that multilingual LMs organize representations around surface form, with linguistic abstraction emerging gradually without collapsing into a unified interlingua.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper claims that multilingual language models organize internal representations primarily around surface-form cues such as orthography and script rather than abstract linguistic structure or a unified interlingua. Using the Language Activation Probability Entropy (LAPE) metric to identify language-associated units, sparse autoencoder decompositions, probing for typological features, and causal interventions, the authors show that romanization produces near-disjoint representations aligning with neither native-script nor English inputs, while word-order shuffling has limited effect. Typological structure becomes more accessible in deeper layers, and generation is most sensitive to surface-invariant units.

Significance. If the results hold after addressing potential confounds, this would meaningfully advance understanding of multilingual representation geometry by demonstrating that surface form dominates over linguistic abstraction. The multi-method design (LAPE, SAEs, probing, interventions) across model families and scales, combined with falsifiable perturbation tests, provides a solid empirical foundation and credit for using defined metrics rather than purely correlational analysis.

major comments (3)
  1. [§4] §4 (Romanization and word-order perturbation results): The central claim that language-associated units (via LAPE) are conditioned on orthography rests on romanization producing disjoint representations. However, since multilingual LMs use BPE-style subword tokenizers, romanization alters token boundaries and vocabulary overlap independently of script. Without controls comparing to tokenization-preserving script changes or tokenizer-matched baselines, this risks confounding orthographic encoding with tokenizer artifacts, weakening the evidence that representations organize around surface form over linguistic structure.
  2. [§3.2 and §5] §3.2 and §5 (LAPE metric and SAE decompositions): The assumption that LAPE and SAE features cleanly isolate orthographic effects from analysis artifacts is load-bearing for the claim of script-over-linguistic organization. The manuscript provides limited validation (e.g., no hyperparameter ablations for SAEs or checks that LAPE units are not proxying token identity), so it remains possible that the observed conditioning reflects tokenizer behavior rather than model-internal script encoding.
  3. [§6] §6 (Causal interventions): The finding that generation is most sensitive to surface-invariant units (rather than typologically aligned ones) requires more detail on intervention implementation, baseline comparisons, and statistical tests. Without these, it is unclear whether the differential sensitivity securely supports the gradual-emergence conclusion over alternative explanations.
minor comments (3)
  1. [§3] The LAPE definition and computation details (e.g., exact probability estimation across languages) could be expanded in §3 for reproducibility.
  2. [Figures] Figures showing representation overlaps or LAPE distributions would benefit from explicit legends, error bars, and statistical annotations.
  3. [Related Work] Add discussion of related work on tokenization effects in multilingual models to better contextualize the romanization results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify potential confounds and strengthen the empirical claims. We address each major point below. Where revisions are needed, we have incorporated additional controls, ablations, and details into the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Romanization and word-order perturbation results): The central claim that language-associated units (via LAPE) are conditioned on orthography rests on romanization producing disjoint representations. However, since multilingual LMs use BPE-style subword tokenizers, romanization alters token boundaries and vocabulary overlap independently of script. Without controls comparing to tokenization-preserving script changes or tokenizer-matched baselines, this risks confounding orthographic encoding with tokenizer artifacts, weakening the evidence that representations organize around surface form over linguistic structure.

    Authors: We agree that tokenizer effects are an important consideration. Our word-order shuffling experiments hold tokenization fixed while disrupting linguistic structure and produce only limited changes to LAPE units, indicating that the romanization effect is not reducible to token-boundary shifts alone. We also compare romanized inputs against English (same script family, different language) and observe disjoint units. In the revision we add an explicit tokenizer-matched baseline (using a fixed BPE vocabulary across scripts where feasible) and a transliteration control that minimizes token-boundary changes. These additions reinforce that orthography shapes unit identity beyond tokenizer artifacts. revision: yes

  2. Referee: [§3.2 and §5] §3.2 and §5 (LAPE metric and SAE decompositions): The assumption that LAPE and SAE features cleanly isolate orthographic effects from analysis artifacts is load-bearing for the claim of script-over-linguistic organization. The manuscript provides limited validation (e.g., no hyperparameter ablations for SAEs or checks that LAPE units are not proxying token identity), so it remains possible that the observed conditioning reflects tokenizer behavior rather than model-internal script encoding.

    Authors: We acknowledge the need for stronger validation. LAPE is defined on activation probabilities across languages and therefore does not directly encode token identity; however, we will add explicit checks showing that the same LAPE units remain stable under different tokenizations of the same script. For SAEs we will include hyperparameter ablations (dictionary size, sparsity coefficient) and report reconstruction fidelity across settings. These additions will be placed in §3.2 and §5 to demonstrate that the observed script conditioning is not an artifact of the analysis pipeline. revision: yes

  3. Referee: [§6] §6 (Causal interventions): The finding that generation is most sensitive to surface-invariant units (rather than typologically aligned ones) requires more detail on intervention implementation, baseline comparisons, and statistical tests. Without these, it is unclear whether the differential sensitivity securely supports the gradual-emergence conclusion over alternative explanations.

    Authors: We agree that additional methodological detail is warranted. In the revised §6 we expand the intervention protocol to specify exactly how units are masked or activated (including the precise activation threshold and layer range), add baseline comparisons (random unit interventions and typological-unit controls), and report statistical tests (paired t-tests with effect sizes and p-values) on the generation-sensitivity differences. These clarifications will better isolate the contribution of surface-invariant units and support the gradual-emergence interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements and interventions form an independent chain

full rationale

The paper defines LAPE as a metric on activation probabilities, applies it to locate language-associated units, then measures the effects of explicit perturbations (romanization, word-order shuffle) and SAE decompositions on those units. Probing and causal interventions are performed on the resulting representations. None of these steps reduce a claimed result to a fitted parameter or self-referential definition; the conclusions follow from observed differences between conditions rather than from any equation or prior self-citation that encodes the target finding by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. Standard assumptions of interpretability methods (LAPE captures language-specific activation, SAEs recover meaningful features) are implicit.

axioms (2)
  • domain assumption LAPE metric accurately reflects language-specific unit activation without metric-specific artifacts
    Used to conclude orthographic conditioning
  • domain assumption Sparse autoencoders decompose activations into interpretable units
    Basis for further analysis of unit identity

pith-pipeline@v0.9.0 · 5484 in / 1232 out tokens · 27350 ms · 2026-05-10T18:58:06.389870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Sparse autoencoders can capture language-specific concepts across diverse languages, 2025

    Sparse autoencoders can capture language- specific concepts across diverse languages.Preprint, arXiv:2507.11230. Mikel Artetxe, Sebastian Ruder, and Dani Yogatama

  2. [2]

    InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online

    On the cross-lingual transferability of mono- lingual representations. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637, Online. Association for Computational Linguistics. David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. Network dissection: Quanti- fying interpretabilit...

  3. [3]

    Albert Costa and Núria Sebastián-Gallés

    Identifying bilingual semantic neural represen- tations across languages.Brain Lang, 120(3):282– 289. Albert Costa and Núria Sebastián-Gallés. 2014. How does the bilingual experience sculpt the brain?Nat Rev Neurosci, 15(5):336–345. D. Crystal. 2003.English as a Global Language. Canto (Cambridge University Press). Cambridge University Press. Boyi Deng, Yu...

  4. [4]

    The Llama 3 Herd of Models

    The llama 3 herd of models.Preprint, arXiv:2407.21783. Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef Van Genabith, and Simon Os- termann. 2025. Language arithmetics: Towards sys- tematic language neuron identification and manip- ulation. InProceedings of the 14th International Joint Conference on Natural Language Processing a...

  5. [5]

    InFindings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 1663–1674, Online

    On the language neutrality of pre-trained mul- tilingual representations. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2020, pages 1663–1674, Online. Association for Computa- tional Linguistics. Jindˇrich Libovický, Rudolf Rosa, and Alexander Fraser

  6. [6]

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda

    How language-neutral is multilingual bert? Preprint, arXiv:1911.03310. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. InProceedings of the 7th BlackboxNLP Workshop: Analy...

  7. [7]

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller

    Understanding Language. Samuel Marks, Can Rager, Eric J Michaud, Yonatan Be- linkov, David Bau, and Aaron Mueller. 2025. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. InThe Thirteenth International Conference on Learning Representa- tions. Jérôme Michaud. 2024. A complex systems perspective on language ev...

  8. [8]

    Telmo Pires, Eva Schlinger, and Dan Garrette

    Scaling neural machine translation to 200 lan- guages.Nature, 630(8018):841–846. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–5001, Flo- rence, Italy. Association for Computational Linguis- tics. Inaya Rahma...

  9. [9]

    Unveiling the influence of amplifying language-specific neurons. InProceedings of the 14th International Joint Conference on Natural Lan- guage Processing and the 4th Conference of the Asia- Pacific Chapter of the Association for Computational Linguistics, pages 919–968, Mumbai, India. The Asian Federation of Natural Language Processing and The Associatio...

  10. [10]

    In Socially Responsible Language Modelling Research

    Low-resource languages jailbreak GPT-4. In Socially Responsible Language Modelling Research. Appendix Contents Below we provide an overview of the appendix. These sections are intended to support the core claims by providing methodological details and extended scaling results. • Appendix A: Frequently Asked Questions (FAQs).Addresses common questions rega...

  11. [11]

    Do language-associated units imply the exis- tence of a universal interlingua?No. While language-associated units are clearly identifi- able and can influence model behavior, our re- sults show that they are predominantly sensitive to surface-form cues such as script and token distribution

  12. [12]

    Is the observed script sensitivity simply an ar- tifact of tokenization?Tokenization necessar- ily introduces distinct input embeddings across scripts, but our analysis goes beyond early-layer effects. We observe that alignment remains low even in intermediate layers, indicating that script sensitivity is not merely a tokenizer artifact but reflects persi...

  13. [13]

    Why use 1B and 2B models for the main ex- position?We center our primary exposition on Llama-3.2-1B and Gemma-2-2B to enable extensive, computationally intensive represen- tational sweeps and causal interventions across many layers and languages. However, to ensure our findings are not artifacts of limited capac- ity, we explicitly validate our core exper...

  14. [14]

    As detailed in our scaling analysis, we validate our findings on Llama-3-8B and Gemma-2-9B

    Do these findings generalize to larger mod- els?Yes. As detailed in our scaling analysis, we validate our findings on Llama-3-8B and Gemma-2-9B. We observe that representational fragmentation under script variation, as well as robustness under structural perturbation, per- sist at these larger scales. Crucially, this frag- mentation remains even though th...

  15. [15]

    Does strong probing performance imply func- tional importance?No. Probing reveals that typological properties become increasingly lin- early accessible in deeper layers, but causal interventions show that functional importance aligns with invariance to surface perturbations. This reinforces the view that linear decodability does not imply causal control

  16. [16]

    An- alyzing both allows us to separate functional relevance from interpretability and avoid over- attributing abstract meaning to sparse features alone

    Why analyze both raw neurons and SAE fea- tures?Raw neurons directly govern model behavior, while SAE features provide an inter- pretable decomposition of these activations. An- alyzing both allows us to separate functional relevance from interpretability and avoid over- attributing abstract meaning to sparse features alone

  17. [17]

    यह एक बहुत ही простой , सस्ती , और ﻗﺎﺑﻞ اﻋﺘﻤﺎد اﺑﺖ कार है ,

    What is the main takeaway for interpret- inglanguage-associated neurons?Language- associated units exist and matter, but they pri- marily reflect surface-form processing rather than abstract language identity. B Extended Related Work B.1 Language-Associated Units and Multilingual Representations Understanding how multilingual LMs encode lan- guage identit...