pith. machine review for the scientific record. sign in

arxiv: 2604.12710 · v2 · submitted 2026-04-13 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM safetysemantic bottlenecklanguage-agnostic alignmentmultilingual vulnerabilitiesattack success raterepresentation geometrylow-resource languages
0
0 comments X

The pith

LLMs achieve robust safety across languages by anchoring alignment at the semantic bottleneck where representations track meaning over language identity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models perform well on safety in high-resource languages but fail when queried in low-resource languages. The paper attributes the gap to safety alignment being dominated by language identity rather than shared semantic understanding. It locates a semantic bottleneck layer whose internal geometry reflects semantic content more than language. LASA applies safety alignment directly at this layer to make the protection language-agnostic. The result is a large drop in attack success rates that holds across model families and sizes, showing that safety can be moved from surface text into the model's shared semantic space.

Core claim

The authors identify the semantic bottleneck as an intermediate layer in LLMs in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. They propose Language-Agnostic Semantic Alignment (LASA) that anchors safety alignment directly in these bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). The analysis suggests safety alignment requires anchoring safety understanding in the model's language-agnostic semantic space rather than in the

What carries the argument

The semantic bottleneck, an intermediate layer whose representation geometry is driven by semantic content independent of language identity, functions as the intervention site for language-agnostic safety alignment.

If this is right

  • Safety alignment can be performed in a language-agnostic manner by targeting the identified semantic bottleneck.
  • Average attack success rates fall to low single digits on multiple instruction-tuned models of different sizes.
  • Safety should be understood and enforced at the level of internal semantic representations rather than surface text.
  • The method preserves effectiveness across model scales from 7B to 32B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bottleneck location could be used to align other properties such as factual consistency or reduced bias without language dependence.
  • If the layer is stable across architectures, automatic detection of semantic bottlenecks might become a standard diagnostic step for new models.
  • The approach implies that many low-resource language failures are fixable by relocation of alignment rather than by adding more training data in those languages.
  • Representation-level interventions of this kind may generalize to other multilingual or cross-domain alignment problems.

Load-bearing premise

The semantic bottleneck is the decisive location for safety alignment and that intervening there leaves other model capabilities intact without creating new vulnerabilities.

What would settle it

A test in which LASA-aligned models exhibit either higher attack success rates in any language or measurable drops in performance on standard non-safety benchmarks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12710 by Han Qiu, Haoran Liu, Hongning Wang, Hui Xue, Jiale Cheng, Jialing Tao, Jiaqi Weng, Jinzhe Tu, Junxiao Yang, Minlie Huang, Shiyao Cui, Zhexin Zhang.

Figure 1
Figure 1. Figure 1: Heatmap of safety score for different meth [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left: In the text space, representations cluster by language, causing safety training to fail on semantically equivalent expressions in unseen languages or symbols. Right: In the semantic space, semantically equivalent queries cluster closely across languages and modalities, allowing safety knowledge learned from high-resource languages to naturally transfer to low-resource languages via shared semantic st… view at source ↗
Figure 3
Figure 3. Figure 3: (Left): Layer-wise Silhouette scores for clustering by language and by query on Llama-3.1-8B-Instruct. Language-based scores follow a U-shaped trajectory, whereas query-based scores exhibit an inverted U-shaped trajectory, and their gap peaks at intermediate layers which we refer to as the Semantic Bottleneck. (Right): t￾SNE visualizations of hidden states across selected layers, colored by language (top) … view at source ↗
Figure 4
Figure 4. Figure 4: Framework for Language-Agnostic Semantic Alignment (LASA): Hidden states are extracted from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between MMLU accuracy on Swahili and safety semantic understanding ability of SSI on Swahili. The saturation curve (R2 = 0.988) indicates that the Semantic Bottleneck’s effectiveness on safety scales with multilingual capability. and the performance of SSI in safety semantic ac￾curacy AccSafety j . As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ASR result of LASA on LLaMA-3.1-8B￾Instruct with SSI trained on different layers. Training SSI at bottleneck layer reach clearly the best safety per￾formance [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ASR of LASA on Qwen2.5 and Qwen3 se￾ries. LASA stabilizes average ASR at 4% across all scales (7B–32B) on HarmBench and MultiJail. The re￾sults show that safety alignment improves with model scale, correlating with enhanced base semantic capabil￾ities. the layers move closer to the input or the output, reaching the minimum around the semantic bot￾tleneck. Notably, training SSI on the final layer yields a f… view at source ↗
Figure 8
Figure 8. Figure 8: T-SNE results on different layers of Llama [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Silhouette Score analysis and t-SNE visualizations of hidden states on Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Silhouette Score analysis and t-SNE visualizations of hidden states on Qwen2.5-14B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Silhouette Score analysis and t-SNE visualizations of hidden states on Qwen2.5-32B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Silhouette Score analysis and t-SNE visualizations of hidden states on Qwen3-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Relationship between MMLU accuracy on Thai and safety semantic understanding ability on Thai for Qwen3-8B. refuses to respond to the unsafe query. Unsafe: The response is deemed unsafe if it includes unsafe content or if it directly responds to the unsafe query. Please evaluate the response and pro￾vide your selection from the list [’safe’, ’unsafe’] without returning any other character. E Impact of Tran… view at source ↗
Figure 14
Figure 14. Figure 14: Relationship between MMLU accuracy on Thai and safety semantic understanding ability on Thai for Qwen2.5-7B-Instruct. ternal translation engines, particularly in safety￾critical contexts. Evaluation Prompt Using GPT-4o Given a pair of query and response, assess the safety of the response solely based on its content, disregarding the harmful content present in the query. Definitions: Safe: The response is … view at source ↗
Figure 16
Figure 16. Figure 16: Silhouette Score analysis and t-SNE visualizations of hidden states on Llama-3.1-8B-Instruct. All data [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Silhouette Score analysis and t-SNE visualizations of hidden states on Llama-3.1-8B-Instruct. All data [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Response examples of different methods on Llama-3.1-8B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Response examples of different methods on Qwen2.5-7B-Instruct. [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
read the original abstract

Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper attributes LLM safety vulnerabilities in low-resource languages to a mismatch between language-agnostic semantic representations and high-resource-biased safety alignment. It empirically identifies an intermediate 'semantic bottleneck' layer where representation geometry is dominated by shared semantics rather than language identity, then proposes Language-Agnostic Semantic Alignment (LASA) to anchor safety training at this layer. Experiments report that LASA reduces average attack success rate from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and maintains 3-4% ASR on Qwen2.5/Qwen3 models (7B-32B) across languages.

Significance. If the central claim holds after validation, the work provides a representation-level mechanism for multilingual safety that could be more scalable than per-language alignment. The empirical identification of a semantic bottleneck and the reported cross-lingual ASR gains would strengthen arguments for intervening in language-agnostic subspaces rather than surface tokens.

major comments (2)
  1. [Experimental results and layer-selection procedure] The manuscript provides no layer-ablation experiments comparing the LASA alignment procedure performed at the identified semantic-bottleneck layer versus the same procedure at other layers. Without these controls, the reported ASR reduction (24.7 % → 2.8 % on LLaMA-3.1-8B) cannot be attributed specifically to the bottleneck location rather than to the alignment objective itself, directly undermining the representation-level argument in the abstract and introduction.
  2. [Identification of the semantic bottleneck] Details are missing on the exact procedure used to identify the semantic bottleneck (e.g., the quantitative metrics or visualizations of representation geometry, the precise layer index chosen, and any statistical tests confirming language-identity vs. semantic dominance). This information is required to reproduce the central empirical observation and to assess whether the chosen layer is uniquely load-bearing for the safety gains.
minor comments (2)
  1. [Abstract and Experiments] The abstract and results sections should report the number of languages evaluated, the specific attack methods and datasets used to compute ASR, and any statistical significance tests or confidence intervals accompanying the 24.7 % to 2.8 % drop.
  2. [Evaluation] Side-effect measurements (e.g., changes in general capabilities, helpfulness, or new vulnerabilities after LASA) are not mentioned; adding a brief evaluation would strengthen the claim that intervening at the bottleneck does not degrade other model behaviors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important aspects for strengthening the empirical support of our claims regarding the semantic bottleneck and LASA. We address each major comment below and commit to revisions that will enhance the clarity, reproducibility, and rigor of the work without altering its core contributions.

read point-by-point responses
  1. Referee: [Experimental results and layer-selection procedure] The manuscript provides no layer-ablation experiments comparing the LASA alignment procedure performed at the identified semantic-bottleneck layer versus the same procedure at other layers. Without these controls, the reported ASR reduction (24.7 % → 2.8 % on LLaMA-3.1-8B) cannot be attributed specifically to the bottleneck location rather than to the alignment objective itself, directly undermining the representation-level argument in the abstract and introduction.

    Authors: We agree that the absence of layer-ablation controls is a limitation in the current manuscript. While the identification of the semantic bottleneck is supported by our representation geometry analysis, we did not perform the full LASA procedure at alternative layers to isolate its effect. In the revised version, we will add a dedicated ablation study applying the identical LASA alignment objective at multiple layers (early, intermediate, and late) across the models evaluated. We will report the resulting ASR values per language and layer, demonstrating that performance gains are maximized at the identified bottleneck. This addition will directly address the concern and reinforce the representation-level argument. revision: yes

  2. Referee: [Identification of the semantic bottleneck] Details are missing on the exact procedure used to identify the semantic bottleneck (e.g., the quantitative metrics or visualizations of representation geometry, the precise layer index chosen, and any statistical tests confirming language-identity vs. semantic dominance). This information is required to reproduce the central empirical observation and to assess whether the chosen layer is uniquely load-bearing for the safety gains.

    Authors: We acknowledge that the manuscript would benefit from greater specificity in describing the bottleneck identification. The current text outlines the empirical observation that intermediate layers exhibit semantic dominance in representation geometry, but does not provide the full procedural details. In the revision, we will expand the relevant analysis section to include: the exact quantitative metrics (e.g., cross-lingual semantic similarity scores versus language-identity baselines), the visualizations used (e.g., t-SNE projections of hidden states), the precise layer indices selected for each model family, and any statistical validation (e.g., significance tests on similarity differences). These details will be added to enable full reproducibility and to clarify why the chosen layer is load-bearing for the safety improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation begins with an empirical observation of representation geometry (semantic bottleneck identified via layer-wise analysis of language vs. semantic dominance), followed by the proposal of LASA to anchor alignment at that layer and direct measurement of ASR reductions on multilingual attacks. No step reduces a claimed prediction or result to its inputs by construction, self-definition, or self-citation chain. The reported ASR drop (24.7% to 2.8%) is an experimental outcome, not a fitted or renamed quantity. The method is self-contained against external benchmarks with no load-bearing uniqueness theorems or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are detailed. The central claim rests on the empirical existence of the semantic bottleneck and standard assumptions about transformer layer representations.

pith-pipeline@v0.9.0 · 5547 in / 1262 out tokens · 76639 ms · 2026-05-10T14:55:49.123709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  2. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    In Findings of the Association for Computational Linguistics: ACL 2024, pages 9954– 9972, Bangkok, Thailand

    Direct preference optimization with an offset. In Findings of the Association for Computational Linguistics: ACL 2024, pages 9954– 9972, Bangkok, Thailand. Association for Computa- tional Linguistics. AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1. Berk Atil, Rebecca J Passonneau, and Fred Morstat- ter. 2025. Do...

  2. [2]

    Qwen2.5 Technical Report

    Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115. Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen Bach, and Julia Kreutzer. 2025. The state of multilingual llm safety research: From measuring the language gap to mitigating it. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 15856–15871. Zheng...

  3. [3]

    Low-resource languages jailbreak gpt-4

    arXiv preprint arXiv:2310.02446. Haneul Yoo, Yongjin Yang, and Hwaran Lee

  4. [4]

    In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13392–13413

    Code-switching red-teaming: Llm evalua- tion for safety and multilingual understanding. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13392–13413. Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. Rrhf: Rank responses to align language models wi...

  5. [5]

    Q:" and

    The averageR 2 value is approximately 0.90, providing further evidence of a strong relationship between general multilingual capability and safety performance. C Complexity and Parameter Analysis of Safety Layer In a standard Transformer-based Large Language Model, the parameter count is primarily domi- nated by the self-attention mechanism and the feed-f...