SLASH the Sink: Sharpening Structural Attention Inside LLMs
Pith reviewed 2026-05-20 22:15 UTC · model grok-4.3
The pith
LLMs spontaneously reconstruct graph topology inside their attention maps as a sawtooth pattern matching the token adjacency matrix, but the attention sink dilutes this ability through conflict with anisotropic bias; a plug-and-play fix rev
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct sawtooth pattern in their attention maps that structurally aligns with the token-level adjacency matrix. This intrinsic structural understanding is diluted by the attention sink, which the authors formalize as a representation bottleneck stemming from a fundamental conflict between the model's anisotropic bias, essential for language tasks, and the topology-aware local aggregation required for graph reasoning. To address this, they propose SLASH, a training-free plug-and-play attention redistribution that amplifies the internal structural understanding and yields significant performance gains on pure grap
What carries the argument
SLASH (StructuraL Attention SHarpening), a plug-and-play attention redistribution technique that amplifies the spontaneous sawtooth alignment with graph topology to overcome dilution from the attention sink.
Load-bearing premise
The sawtooth pattern in attention maps represents genuine internal reconstruction of graph topology, and its dilution stems specifically from a representation bottleneck caused by conflict between anisotropic bias and topology-aware local aggregation.
What would settle it
Showing that attention maps lack the described sawtooth alignment with the token-level adjacency matrix, or that applying the SLASH redistribution produces no measurable improvement on graph reasoning tasks.
Figures
read the original abstract
Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (SLASH), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate that SLASH delivers significant and consistent performance gains across diverse LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs spontaneously reconstruct graph topology when processing serialized inputs, as shown by a distinct 'sawtooth' pattern in attention maps that aligns with the token-level adjacency matrix. This intrinsic capability is diluted by the attention sink due to a conflict between the model's anisotropic bias (useful for language) and the need for topology-aware local aggregation. The authors formalize this as a representation bottleneck and propose SLASH, a training-free, plug-and-play attention redistribution technique to sharpen the structural signal. Experiments on pure graph tasks and molecular property prediction are reported to show consistent performance gains across diverse LLMs without fine-tuning or external adapters.
Significance. If the mechanistic interpretation of the sawtooth pattern as genuine topology reconstruction holds and SLASH's gains are robustly attributable to it, the work would be significant for enabling efficient, training-free improvements to LLMs' structural reasoning on graphs and molecules. This addresses a known limitation without the computational cost of adapters or fine-tuning, and the plug-and-play nature could have broad applicability. The absence of parameter-free derivations or machine-checked proofs limits the theoretical strength, but reproducible attention redistribution would be a practical contribution if validated.
major comments (3)
- [§3] §3 (or equivalent section on internal mechanisms): The central claim that the sawtooth pattern constitutes spontaneous reconstruction of graph topology (aligning with the token-level adjacency matrix) lacks controls to distinguish it from serialization artifacts. For example, no comparison is shown with node-order permutations, alternative linearizations, or non-graph sequences that preserve token positions; without these, the dilution-by-sink story and motivation for SLASH rest on an untested interpretation.
- [Experiments] Experiments section (likely §5): The reported performance gains on graph tasks and molecular prediction provide no details on statistical tests, variance across runs, or ablation of the redistribution parameters in SLASH. This makes it impossible to assess whether the gains are significant, consistent, or specifically due to amplifying the sawtooth rather than a generic attention adjustment.
- [§4] Theoretical formalization (likely §4): The representation bottleneck is described as arising from conflict between anisotropic bias and local aggregation, but no equation or derivation quantifies how the attention sink specifically suppresses the topology signal; the account remains qualitative and does not reduce to a testable model that predicts the observed sawtooth dilution.
minor comments (2)
- [Methods] The definition and construction of the 'token-level adjacency matrix' should include a small illustrative example in the main text or a dedicated figure to clarify how it is derived from the serialized graph input.
- [Figures] Figure captions for attention maps should explicitly state the graph size, serialization order, and layer/head indices shown, to allow readers to reproduce the sawtooth observation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the work.
read point-by-point responses
-
Referee: [§3] §3 (or equivalent section on internal mechanisms): The central claim that the sawtooth pattern constitutes spontaneous reconstruction of graph topology (aligning with the token-level adjacency matrix) lacks controls to distinguish it from serialization artifacts. For example, no comparison is shown with node-order permutations, alternative linearizations, or non-graph sequences that preserve token positions; without these, the dilution-by-sink story and motivation for SLASH rest on an untested interpretation.
Authors: We agree that the current presentation would benefit from explicit controls to isolate topology reconstruction from potential serialization effects. While the sawtooth pattern is shown to align quantitatively with the token-level adjacency matrix across multiple datasets and models, we will add a dedicated subsection with new experiments: attention maps under node-order permutations, alternative linearizations (e.g., BFS versus DFS traversals), and non-graph sequences with matched token positions. These will be used to test whether the pattern persists specifically for graph-structured inputs. revision: yes
-
Referee: [Experiments] Experiments section (likely §5): The reported performance gains on graph tasks and molecular prediction provide no details on statistical tests, variance across runs, or ablation of the redistribution parameters in SLASH. This makes it impossible to assess whether the gains are significant, consistent, or specifically due to amplifying the sawtooth rather than a generic attention adjustment.
Authors: We acknowledge that the experiments section lacks reported variance, statistical tests, and parameter ablations. In the revised manuscript we will report mean results with standard deviations over at least five random seeds, include paired statistical tests (e.g., Wilcoxon signed-rank) to establish significance of gains, and add ablations on SLASH hyperparameters such as redistribution strength and sink suppression ratio. These will demonstrate that performance improvements track the amplification of the structural attention signal rather than generic redistribution. revision: yes
-
Referee: [§4] Theoretical formalization (likely §4): The representation bottleneck is described as arising from conflict between anisotropic bias and local aggregation, but no equation or derivation quantifies how the attention sink specifically suppresses the topology signal; the account remains qualitative and does not reduce to a testable model that predicts the observed sawtooth dilution.
Authors: The current Section 4 offers a conceptual formalization of the representation bottleneck arising from the tension between anisotropic language bias and topology-aware aggregation. To make this more quantitative, we will introduce a simplified mathematical model in the revision that expresses the dilution of the sawtooth pattern as a function of sink-induced attention mass and the model's anisotropy parameter, including a short derivation showing how this predicts reduced local topology signal. This will render the account more directly testable against the observed attention maps. revision: yes
Circularity Check
No circularity: claims rest on empirical attention patterns and external performance validation
full rationale
The paper reports an observed 'sawtooth' pattern in attention maps as evidence of spontaneous topology reconstruction, then proposes a training-free redistribution method (SLASH) that yields measured gains on graph and molecular tasks. No equations or derivations are shown that reduce a claimed prediction back to fitted inputs or self-citations by construction. The theoretical framing of a 'representation bottleneck' is presented as post-hoc explanation of the observed dilution rather than a closed mathematical loop. Self-citations, if present, are not load-bearing for the core empirical finding or the plug-and-play intervention.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs possess an anisotropic bias that is essential for language tasks but suppresses topology-aware local aggregation
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLMs spontaneously reconstruct the graph’s topology internally, evidenced by a distinct 'sawtooth' pattern in their attention maps that structurally aligns with the 'token-level adjacency matrix'
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 4.1 (Geometric Contraction) … ∥hk−hl∥=(1−λ)∥htopok−htopol∥
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_high_calibrated_iff unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Proposition 4.2 (Dirichlet Energy Decay) … EDir(H)≈(1−λ)²EDir(Htopo)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chen, Runjin and Zhao, Tong and Jaiswal, Ajay Kumar and Shah, Neil and Wang, Zhangyang , booktitle =. 2024 , editor =
work page 2024
-
[2]
HiGPT: Heterogeneous Graph Language Model , booktitle =
Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Long Xia and Dawei Yin and Chao Huang , editor =. HiGPT: Heterogeneous Graph Language Model , booktitle =. 2024 , timestamp =
work page 2024
-
[3]
The Eleventh International Conference on Learning Representations,
Jianan Zhao and Meng Qu and Chaozhuo Li and Hao Yan and Qian Liu and Rui Li and Xing Xie and Jian Tang , title =. The Eleventh International Conference on Learning Representations,. 2023 , timestamp =
work page 2023
-
[4]
ICML 2024 AI for Science Workshop , year=
A multi-view mixture-of-experts based on language and graphs for molecular properties prediction , author=. ICML 2024 AI for Science Workshop , year=
work page 2024
-
[5]
The Twelfth International Conference on Learning Representations,
Hao Liu and Jiarui Feng and Lecheng Kong and Ningyue Liang and Dacheng Tao and Yixin Chen and Muhan Zhang , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =
work page 2024
-
[6]
Pure Transformers are Powerful Graph Learners , booktitle =
Jinwoo Kim and Dat Nguyen and Seonwoo Min and Sungjun Cho and Moontae Lee and Honglak Lee and Seunghoon Hong , editor =. Pure Transformers are Powerful Graph Learners , booktitle =. 2022 , timestamp =
work page 2022
-
[7]
Can Language Models Solve Graph Problems in Natural Language? , booktitle =
Heng Wang and Shangbin Feng and Tianxing He and Zhaoxuan Tan and Xiaochuang Han and Yulia Tsvetkov , editor =. Can Language Models Solve Graph Problems in Natural Language? , booktitle =. 2023 , timestamp =
work page 2023
-
[8]
Jianing Wang and Junda Wu and Yupeng Hou and Yao Liu and Ming Gao and Julian J. McAuley , editor =. InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment , booktitle =. 2024 , timestamp =
work page 2024
-
[9]
GraphGPT: Graph Instruction Tuning for Large Language Models , booktitle =
Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Lixin Su and Suqi Cheng and Dawei Yin and Chao Huang , editor =. GraphGPT: Graph Instruction Tuning for Large Language Models , booktitle =. 2024 , timestamp =
work page 2024
-
[10]
GraphWiz: An Instruction-Following Language Model for Graph Computational Problems , booktitle =
Nuo Chen and Yuhan Li and Jianheng Tang and Jia Li , editor =. GraphWiz: An Instruction-Following Language Model for Graph Computational Problems , booktitle =. 2024 , timestamp =
work page 2024
-
[11]
Haoyu Peter Wang and Peihao Wang and Mufei Li and Shikun Liu and Siqi Miao and Zhangyang Wang and Pan Li , booktitle=. Graph-
-
[12]
The Twelfth International Conference on Learning Representations,
Bahare Fatemi and Jonathan Halcrow and Bryan Perozzi , title =. The Twelfth International Conference on Learning Representations,. 2024 , timestamp =
work page 2024
-
[13]
What Does BERT Learn about the Structure of Language?
Jawahar, Ganesh and Sagot, Beno \^i t and Seddah, Djam \'e. What Does BERT Learn about the Structure of Language?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019
work page 2019
-
[14]
Measures of Entropy From Data Using Infinitely Divisible Kernels , journal =
Luis Gonzalo S. Measures of Entropy From Data Using Infinitely Divisible Kernels , journal =. 2015 , timestamp =
work page 2015
-
[15]
The Twelfth International Conference on Learning Representations , year=
Efficient Streaming Language Models with Attention Sinks , author=. The Twelfth International Conference on Learning Representations , year=
-
[16]
Why do llms attend to the first token?arXiv preprint arXiv:2504.02732, 2025
Federico Barbero and Alvaro Arroyo and Xiangming Gu and Christos Perivolaropoulos and Michael M. Bronstein and Petar Velickovic and Razvan Pascanu , title =. CoRR , volume =. 2025 , eprinttype =. 2504.02732 , timestamp =
-
[17]
Ethayarajh, Kawin. How Contextual are Contextualized Word Representations? C omparing the Geometry of BERT , ELM o, and GPT -2 Embeddings. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019
work page 2019
-
[18]
A Threshold Selection Method from Gray-Level Histograms , year=
Otsu, Nobuyuki , journal=. A Threshold Selection Method from Gray-Level Histograms , year=
-
[19]
Gomez and Lukasz Kaiser and Illia Polosukhin , title =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =
-
[20]
What Does BERT Look at? An Analysis of BERT ' s Attention
Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D. What Does BERT Look at? An Analysis of BERT ' s Attention. Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2019
work page 2019
-
[21]
Forty-first International Conference on Machine Learning , year=
Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration , author=. Forty-first International Conference on Machine Learning , year=
-
[22]
arXiv preprint arXiv:2510.06477 , year=
Enrique Queipo. Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin , journal =. 2025 , eprinttype =. 2510.06477 , timestamp =
-
[23]
The Thirteenth International Conference on Learning Representations , year=
When Attention Sink Emerges in Language Models: An Empirical View , author=. The Thirteenth International Conference on Learning Representations , year=
-
[24]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[25]
Forty-second International Conference on Machine Learning , year=
Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=
-
[26]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Do Language Models Use Their Depth Efficiently? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[27]
The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models
Razzhigaev, Anton and Mikhalchuk, Matvey and Goncharova, Elizaveta and Oseledets, Ivan and Dimitrov, Denis and Kuznetsov, Andrey. The Shape of Learning: Anisotropy and Intrinsic Dimensions in Transformer-Based Models. Findings of the Association for Computational Linguistics: EACL 2024. 2024
work page 2024
-
[28]
and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S
Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 2018
work page 2018
-
[29]
Yuyan Liu and Sirui Ding and Sheng Zhou and Wenqi Fan and Qiaoyu Tan , title =. CoRR , volume =. 2024 , eprinttype =. 2406.12950 , timestamp =
-
[30]
Llama Team , title =. CoRR , volume =. 2024 , eprinttype =. 2407.21783 , timestamp =
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and Chujie Zheng and Dayiheng Liu and Fan Zhou and Fei Huang and Feng Hu and Hao Ge and Haoran Wei and Huan Lin and Jialong Tang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jian Yang and Jiaxi Yang and Ji...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Hao Yuan and Haiyang Yu and Shurui Gui and Shuiwang Ji , title =. 2023 , timestamp =
work page 2023
-
[33]
ICML 2025 Workshop on Methods and Opportunities at Small Scale , year=
ZeroTuning: Unlocking the Initial Token's Power to Enhance Large Language Models Without Training , author=. ICML 2025 Workshop on Methods and Opportunities at Small Scale , year=
work page 2025
-
[34]
Identifying and Evaluating Inactive Heads in Pretrained LLMs , author=. 2025 , eprint=
work page 2025
-
[35]
Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs , journal =
Batu El and Deepro Choudhury and Pietro Li. Towards Mechanistic Interpretability of Graph Transformers via Attention Graphs , journal =. 2025 , eprinttype =. 2502.12352 , timestamp =
-
[36]
Wen, Zhihao and Fang, Yuan , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =
work page 2023
-
[37]
The Fourteenth International Conference on Learning Representations,
Wu, Jingyao and Lu, Bin and Di, Zijun and Gan, Xiaoying and Jin, Meng and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , title =. The Fourteenth International Conference on Learning Representations,
-
[38]
Jingyao Wu and Bin Lu and Zijun Di and Xiaoying Gan and Meng Jin and Luoyi Fu and Xinbing Wang and Chenghu Zhou , title =. CoRR , volume =. 2026 , eprinttype =. 2602.01771 , timestamp =
-
[39]
Graph Out-of-Distribution Generalization With Controllable Data Augmentation , year=
Lu, Bin and Zhao, Ze and Gan, Xiaoying and Liang, Shiyu and Fu, Luoyi and Wang, Xinbing and Zhou, Chenghu , journal=. Graph Out-of-Distribution Generalization With Controllable Data Augmentation , year=
-
[40]
Wang, Jianing and Wu, Junda and Hou, Yupeng and Liu, Yao and Gao, Ming and McAuley, Julian. I nstruct G raph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment. Findings of the Association for Computational Linguistics: ACL 2024. 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.