Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

Benjamin Piwowarski; Camille Barboule; Pierre-Antoine Lequeu

arxiv: 2605.30022 · v1 · pith:Q4VVHOCZnew · submitted 2026-05-28 · 💻 cs.CL · cs.AI

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

Pierre-Antoine Lequeu , Camille Barboule , Benjamin Piwowarski This is my paper

Pith reviewed 2026-06-29 07:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords positional encodingdisentanglementtransformer encodersmasked language modelinglinguistic probingabsolute positionalrelative positionalattention specialization

0 comments

The pith

Disentangling semantic and positional streams in Transformers preserves positional encodings and improves on 49 of 65 linguistic phenomena.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper modifies a Transformer encoder to process semantic, absolute positional, and relative positional information in three separate streams while confining the masked language modeling objective to the semantic stream only. This decoupling causes the isolated absolute positional subspace to collapse into a low-frequency two-dimensional manifold that captures document structure. Attention heads divide into structure-oriented and semantic-oriented groups, and the disentangled model retains positional information more robustly than entangled baselines such as RoPE. A sympathetic reader would care because clearer separation of order signals could support more reliable long-context and retrieval behavior.

Core claim

By processing semantic, absolute positional (AP), and relative positional (RP) signals in explicitly disentangled streams and restricting the MLM objective to the semantic stream, the isolated AP subspace collapses into a low-frequency two-dimensional manifold that captures the structure of the document, attention heads specialize into structure and semantic-oriented groups with RP supporting the latter, and the disentangled approach preserves positional encoding better than standard methods, improving linguistic representation on 49 of the 65 phenomena of the Flash-Holmes probing benchmark.

What carries the argument

Three explicitly disentangled streams (semantic, absolute positional, relative positional) in an encoder Transformer with the MLM objective confined to the semantic stream.

If this is right

The isolated absolute positional subspace spontaneously collapses into a low-frequency two-dimensional manifold capturing document structure.
Attention heads specialize into structure-oriented and semantic-oriented groups, with relative positional encodings supporting semantic processing.
Standard positional encodings such as RoPE and RP only weakly encode macroscopic structure, while entangled absolute positional encodings lose it in final layers under MLM pressure.
The disentangled approach improves performance on 49 of 65 linguistic phenomena in the Flash-Holmes probing benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit separation could allow targeted modifications to the positional streams for improving long-context understanding without affecting semantics.
The 2D manifold might be encouraged in other positional encoding schemes to retain document-level information.
Applying the disentangled model to retrieval or long-context tasks could test whether preserved structure yields practical gains beyond probing.
This architecture might serve as a diagnostic tool for studying how positional information is processed separately from meaning.

Load-bearing premise

The three streams remain cleanly separated during training without the semantic-only MLM objective causing leakage or collapse in the positional streams.

What would settle it

Failure of the absolute positional subspace to collapse into a two-dimensional manifold, or absence of improvement on the Flash-Holmes benchmark under the disentangled training regime, would falsify the preservation claim.

Figures

Figures reproduced from arXiv: 2605.30022 by Benjamin Piwowarski, Camille Barboule, Pierre-Antoine Lequeu.

**Figure 2.** Figure 2: Categorization of all heads across layers of [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Softmax applied independently to the last layer’s attention weights for semantic (1st row), AP (2nd row) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: 2-dimensional PCA of the AP hidden states at each layer when encoding a long document. Each sentence [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: DSTG-NeoBERT attention weights mechanism. Grayed-out squares correspond to discarded weights. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Cumulative explained variance of singular [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Results of the structural probes on DSTG-NeoBERT with MLM on semantic only (red) and DSTG [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Regression from NeoBERT variants to DSTG-NeoBERT subspaces. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: The first three PCs of the AP embeddings of different models. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-stream disentangling setup is a clean experimental move but the isolation claim rests on thin controls.

read the letter

The paper splits a Transformer into three separate streams for semantics, absolute position, and relative position, then trains the masked language modeling loss only on the semantic stream. This produces the observations that the absolute positional stream collapses on its own into a low-frequency 2D manifold reflecting document structure, that attention heads split into structure-oriented and semantic-oriented groups, and that the disentangled model scores better on 49 of 65 Flash-Holmes probing tasks.

The explicit three-stream architecture is the real addition. Earlier work noted orthogonal subspaces; this version turns the separation into a controllable experimental lever and reports the spontaneous collapse and head specialization as direct consequences. That is useful for anyone trying to understand why current positional encodings lose macroscopic structure under standard training.

The main weakness is the lack of evidence that the streams stay isolated. Confining the loss does not automatically block cross-stream attention or gradient flow, so semantic information could still reach the positional parameters. Without ablations that test for leakage, the collapse and the probing gains cannot be confidently attributed to disentanglement rather than training artifacts. The abstract gives no numbers, error bars, or implementation specifics that would let a reader check this.

The work is aimed at researchers studying positional encodings and mechanistic interpretability. A reader in that area would find the setup worth trying even if the current results need tighter verification. It deserves a serious referee because the idea is straightforward to implement and the claims are specific enough to test or falsify.

Referee Report

2 major / 1 minor

Summary. The manuscript modifies an encoder Transformer to maintain three explicitly disentangled streams (semantic, absolute positional AP, relative positional RP) and confines the MLM objective exclusively to the semantic stream. It reports that the isolated AP subspace spontaneously collapses to a low-frequency 2D manifold capturing document structure, that attention heads specialize (with RP supporting semantic processing), and that this disentangled model improves linguistic representation on 49 of 65 phenomena in the Flash-Holmes probing benchmark while standard encodings (RoPE, entangled AP) lose macroscopic structure under MLM pressure.

Significance. If the separation is verifiably clean and the reported gains are robust, the work supplies a mechanistic account of how positional information is stored and processed in Transformers and offers an empirical route to preserving positional structure that could inform better long-context encodings.

major comments (2)

[model modification and training objective section] Model modification and training objective section: the claim of explicit disentanglement rests on confining MLM loss to the semantic stream, yet the text provides no mechanism (zeroed cross-stream attention, gradient blocking, or orthogonality constraint) that would provably prevent semantic signals from reaching AP/RP parameters via residuals or shared components. This is load-bearing for attributing the 2D AP collapse, head specialization, and 49/65 probing gains to isolation rather than training artifacts.
[results and probing benchmark section] Results and probing benchmark section: the statement that the disentangled approach 'improves linguistic representation on 49 of the 65 linguistic phenomena' is presented without reported baseline scores per phenomenon, statistical significance, or ablation that isolates the contribution of stream separation from other architectural changes.

minor comments (1)

[abstract] The abstract cites chen-etal-2025-hope but the reference list entry is not shown in the provided text; ensure all citations are complete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the isolation mechanisms and the reporting of results. Revisions will be made where the manuscript can be strengthened without altering its core claims.

read point-by-point responses

Referee: [model modification and training objective section] Model modification and training objective section: the claim of explicit disentanglement rests on confining MLM loss to the semantic stream, yet the text provides no mechanism (zeroed cross-stream attention, gradient blocking, or orthogonality constraint) that would provably prevent semantic signals from reaching AP/RP parameters via residuals or shared components. This is load-bearing for attributing the 2D AP collapse, head specialization, and 49/65 probing gains to isolation rather than training artifacts.

Authors: The architecture maintains three streams with fully separate parameter sets and independent attention computations; no cross-stream attention is performed, and residuals remain stream-specific. The final MLM prediction head receives input exclusively from the semantic stream, so the loss produces no gradient signal to AP or RP parameters. We agree the manuscript would benefit from an explicit forward-pass diagram and gradient-flow description to make this isolation unambiguous. We will add both in the revision. revision: partial
Referee: [results and probing benchmark section] Results and probing benchmark section: the statement that the disentangled approach 'improves linguistic representation on 49 of the 65 linguistic phenomena' is presented without reported baseline scores per phenomenon, statistical significance, or ablation that isolates the contribution of stream separation from other architectural changes.

Authors: Per-phenomenon accuracies for all models appear in Appendix C. We will promote a compact table of the 65 scores to the main text, add McNemar tests for significance on the reported improvements, and include an explicit statement that the only architectural difference between the disentangled model and the entangled-AP baseline is the stream separation itself. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations from model modification, no derivations or reductions to inputs

full rationale

The paper describes an architectural modification to create three disentangled streams and confines MLM loss to the semantic stream, then reports empirical observations such as AP subspace collapse and probing improvements. No equations, derivations, or fitted parameters are presented that could reduce predictions to inputs by construction. The work builds on prior evidence of orthogonal subspaces but does not rely on self-citations for load-bearing uniqueness theorems or ansatzes. All central claims are framed as experimental outcomes rather than tautological redefinitions, satisfying the criteria for a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the domain assumption that positional and semantic signals occupy nearly orthogonal subspaces and on the modeling choice that MLM can be confined to one stream without side effects.

axioms (1)

domain assumption Positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers
Invoked in the opening paragraph to justify the disentangling modification.

pith-pipeline@v0.9.1-grok · 5770 in / 1217 out tokens · 30554 ms · 2026-06-29T07:22:14.362965+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 28 canonical work pages · 9 internal anchors

[1]

Lola Le Breton, Quentin Fournier, John X Morris, Mariam El Mezouar, and Sarath Chandar. 2024. NeoBERT : A Next-Generation BERT

2024
[2]

Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, and Wei Liu. 2025. https://doi.org/10.18653/v1/2025.acl-long.1123 H o PE : A novel positional encoding without long-term decay for enhanced context awareness and extrapolation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23044--23056, Vi...

work page doi:10.18653/v1/2025.acl-long.1123 2025
[3]

Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, and Alexander Rudnicky. 2022. Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386--8399

2022
[4]

Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. 2023. https://doi.org/10.18653/v1/2023.acl-long.756 Dissecting transformer length extrapolation via the lens of receptive field analysis . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13522--13537, Toronto, Canada...

work page doi:10.18653/v1/2023.acl-long.756 2023
[5]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://doi.org/10.48550/arXiv.2003.10555 ELECTRA : Pre-training Text Encoders as Discriminators Rather Than Generators . Preprint, arXiv:2003.10555

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2003.10555 2020
[6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019
[7]

Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221

1948
[8]

Olga Golovneva, Tianlu Wang, Jason Weston, and Sainbayar Sukhbaatar. 2024. https://doi.org/10.48550/arXiv.2405.18719 Contextual Position Encoding : Learning to Count What 's Important . Preprint, arXiv:2405.18719

work page doi:10.48550/arxiv.2405.18719 2024
[9]

Zihan Gu, Ruoyu Chen, Han Zhang, Hua Zhang, and Yue Hu. 2026. https://openreview.net/forum?id=D0u0glT060 Deconstructing positional information: From attention logits to training biases . In The Fourteenth International Conference on Learning Representations

2026
[10]

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. 2024. https://doi.org/10.48550/arXiv.2401.16421 Two Stones Hit One Bird : Bilevel Positional Encoding for Better Length Extrapolation . Preprint, arXiv:2401.16421

work page doi:10.48550/arxiv.2401.16421 2024
[11]

Guolin Ke, Di He, and Tie-Yan Liu. 2021. https://doi.org/10.48550/arXiv.2006.15595 Rethinking Positional Encoding in Language Pre-training . Preprint, arXiv:2006.15595

work page doi:10.48550/arxiv.2006.15595 2021
[12]

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.266 SHAPE : S hifted absolute position embedding for transformers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3309--3321, Online and Punta Cana, Dominican Republic. Association for Computatio...

work page doi:10.18653/v1/2021.emnlp-main.266 2021
[13]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. https://doi.org/10.48550/arXiv.1909.11942 ALBERT : A Lite BERT for Self-supervised Learning of Language Representations . Preprint, arXiv:1909.11942

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.11942 2020
[14]

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. 2024. https://doi.org/10.48550/arXiv.2310.04418 Functional Interpolation for Relative Positions Improves Long Context Transformers . Preprint, arXiv:2310.04418

work page doi:10.48550/arxiv.2310.04418 2024
[15]

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. 2020. Learning to Encode Position for Transformer with Continuous Dynamical Model . In Proceedings of the 37th International Conference on Machine Learning , pages 6327--6335. PMLR

2020
[16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://doi.org/10.48550/arXiv.1907.11692 RoBERTa : A Robustly Optimized BERT Pretraining Approach . Preprint, arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11692 2019
[17]

Ilya Loshchilov and Frank Hutter. 2019. https://doi.org/10.48550/arXiv.1711.05101 Decoupled Weight Decay Regularization . Preprint, arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2019
[18]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. https://arxiv.org/abs/1609.07843 Pointer sentinel mixture models . Preprint, arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.148 2023
[20]

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal , Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. https://doi.org/10.52202/079017-0970 The FineWeb datasets: Decanting the web for the finest text data at scale . In Advances in Neural Information Processing Systems, volume 37, pages 30811--30849. Curran Assoc...

work page doi:10.52202/079017-0970 2024
[21]

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. https://doi.org/10.48550/arXiv.2309.00071 YaRN : Efficient Context Window Extension of Large Language Models . Preprint, arXiv:2309.00071

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00071 2023
[22]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. 2022. https://doi.org/10.48550/arXiv.2108.12409 Train Short , Test Long : Attention with Linear Biases Enables Input Length Extrapolation . Preprint, arXiv:2108.12409

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.12409 2022
[23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1--67

2020
[24]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016
[25]

Andrew Rosenberg and Julia Hirschberg. 2007. https://aclanthology.org/D07-1043/ V -measure: A conditional entropy-based external cluster evaluation measure . In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 410--420, Prague, Czech Republi...

2007
[26]

Noam Shazeer. 2020. https://doi.org/10.48550/arXiv.2002.05202 GLU Variants Improve Transformer . Preprint, arXiv:2002.05202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.05202 2020
[27]

Jiajun Song and Yiqiao Zhong. 2024. https://doi.org/10.48550/arXiv.2310.04861 Uncovering hidden geometry in Transformers via disentangling position and context . Preprint, arXiv:2310.04861

work page doi:10.48550/arxiv.2310.04861 2024
[28]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. https://doi.org/10.48550/arXiv.2104.09864 RoFormer : Enhanced Transformer with Rotary Position Embedding . Preprint, arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023
[29]

Felipe Urrutia, Jorge Salas, Alexander Kozachinskiy, Cristian Buc Calderon, Hector Pasten, and Cristobal Rojas. 2025. https://arxiv.org/abs/2511.11579 Decoupling positional and symbolic attention behavior in transformers . Preprint, arXiv:2511.11579

work page arXiv 2025
[30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need . In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc

2017
[31]

Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, and Iryna Gurevych. 2024. https://doi.org/10.1162/tacl_a_00718 Holmes: A Benchmark to Assess the Linguistic Competence of Language Models . Transactions of the Association for Computational Linguistics, 12:1616--1647

work page doi:10.1162/tacl_a_00718 2024
[32]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels,...

work page doi:10.18653/v1/w18-5446 2018
[33]

Yu-An Wang and Yun-Nung Chen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.555 What do position embeddings learn? an empirical study of pre-trained language model positional encoding . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6840--6849, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.555 2020
[34]

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. https://doi.org/10.48550/arXiv.2404.15574 Retrieval Head Mechanistically Explains Long-Context Factuality . Preprint, arXiv:2404.15574

work page doi:10.48550/arxiv.2404.15574 2024
[35]

Zijun Wu, Anup Anand Deshmukh, Yongkang Wu, Jimmy Lin, and Lili Mou. 2025. https://doi.org/10.1162/coli_a_00545 The emergence of chunking structures with hierarchical RNN . Computational Linguistics, 51(3):815--841

work page doi:10.1162/coli_a_00545 2025
[36]

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc

2019
[37]

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and 1 others. 2024. Dape: Data-adaptive positional encoding for length extrapolation. Advances in Neural Information Processing Systems, 37:26659--26700

2024
[38]

Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. 2025. https://doi.org/10.18653/v1/2025.acl-long.522 DAPE v2: Process attention score as feature map for length extrapolation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin...

work page doi:10.18653/v1/2025.acl-long.522 2025

[1] [1]

Lola Le Breton, Quentin Fournier, John X Morris, Mariam El Mezouar, and Sarath Chandar. 2024. NeoBERT : A Next-Generation BERT

2024

[2] [2]

Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, and Wei Liu. 2025. https://doi.org/10.18653/v1/2025.acl-long.1123 H o PE : A novel positional encoding without long-term decay for enhanced context awareness and extrapolation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23044--23056, Vi...

work page doi:10.18653/v1/2025.acl-long.1123 2025

[3] [3]

Ta-Chung Chi, Ting-Han Fan, Peter J Ramadge, and Alexander Rudnicky. 2022. Kerple: Kernelized relative positional embedding for length extrapolation. Advances in Neural Information Processing Systems, 35:8386--8399

2022

[4] [4]

Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. 2023. https://doi.org/10.18653/v1/2023.acl-long.756 Dissecting transformer length extrapolation via the lens of receptive field analysis . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13522--13537, Toronto, Canada...

work page doi:10.18653/v1/2023.acl-long.756 2023

[5] [5]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. https://doi.org/10.48550/arXiv.2003.10555 ELECTRA : Pre-training Text Encoders as Discriminators Rather Than Generators . Preprint, arXiv:2003.10555

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2003.10555 2020

[6] [6]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423 BERT : Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long a...

work page doi:10.18653/v1/n19-1423 2019

[7] [7]

Rudolph Flesch. 1948. A new readability yardstick. Journal of applied psychology, 32(3):221

1948

[8] [8]

Olga Golovneva, Tianlu Wang, Jason Weston, and Sainbayar Sukhbaatar. 2024. https://doi.org/10.48550/arXiv.2405.18719 Contextual Position Encoding : Learning to Count What 's Important . Preprint, arXiv:2405.18719

work page doi:10.48550/arxiv.2405.18719 2024

[9] [9]

Zihan Gu, Ruoyu Chen, Han Zhang, Hua Zhang, and Yue Hu. 2026. https://openreview.net/forum?id=D0u0glT060 Deconstructing positional information: From attention logits to training biases . In The Fourteenth International Conference on Learning Representations

2026

[10] [10]

Zhenyu He, Guhao Feng, Shengjie Luo, Kai Yang, Liwei Wang, Jingjing Xu, Zhi Zhang, Hongxia Yang, and Di He. 2024. https://doi.org/10.48550/arXiv.2401.16421 Two Stones Hit One Bird : Bilevel Positional Encoding for Better Length Extrapolation . Preprint, arXiv:2401.16421

work page doi:10.48550/arxiv.2401.16421 2024

[11] [11]

Guolin Ke, Di He, and Tie-Yan Liu. 2021. https://doi.org/10.48550/arXiv.2006.15595 Rethinking Positional Encoding in Language Pre-training . Preprint, arXiv:2006.15595

work page doi:10.48550/arxiv.2006.15595 2021

[12] [12]

Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.266 SHAPE : S hifted absolute position embedding for transformers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3309--3321, Online and Punta Cana, Dominican Republic. Association for Computatio...

work page doi:10.18653/v1/2021.emnlp-main.266 2021

[13] [13]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. https://doi.org/10.48550/arXiv.1909.11942 ALBERT : A Lite BERT for Self-supervised Learning of Language Representations . Preprint, arXiv:1909.11942

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.11942 2020

[14] [14]

Shanda Li, Chong You, Guru Guruganesh, Joshua Ainslie, Santiago Ontanon, Manzil Zaheer, Sumit Sanghai, Yiming Yang, Sanjiv Kumar, and Srinadh Bhojanapalli. 2024. https://doi.org/10.48550/arXiv.2310.04418 Functional Interpolation for Relative Positions Improves Long Context Transformers . Preprint, arXiv:2310.04418

work page doi:10.48550/arxiv.2310.04418 2024

[15] [15]

Xuanqing Liu, Hsiang-Fu Yu, Inderjit Dhillon, and Cho-Jui Hsieh. 2020. Learning to Encode Position for Transformer with Continuous Dynamical Model . In Proceedings of the 37th International Conference on Machine Learning , pages 6327--6335. PMLR

2020

[16] [16]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://doi.org/10.48550/arXiv.1907.11692 RoBERTa : A Robustly Optimized BERT Pretraining Approach . Preprint, arXiv:1907.11692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1907.11692 2019

[17] [17]

Ilya Loshchilov and Frank Hutter. 2019. https://doi.org/10.48550/arXiv.1711.05101 Decoupled Weight Decay Regularization . Preprint, arXiv:1711.05101

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1711.05101 2019

[18] [18]

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. https://arxiv.org/abs/1609.07843 Pointer sentinel mixture models . Preprint, arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.148 2023

[20] [20]

Guilherme Penedo, Hynek Kydl \' c ek, Loubna Ben allal , Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. 2024. https://doi.org/10.52202/079017-0970 The FineWeb datasets: Decanting the web for the finest text data at scale . In Advances in Neural Information Processing Systems, volume 37, pages 30811--30849. Curran Assoc...

work page doi:10.52202/079017-0970 2024

[21] [21]

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. 2023. https://doi.org/10.48550/arXiv.2309.00071 YaRN : Efficient Context Window Extension of Large Language Models . Preprint, arXiv:2309.00071

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00071 2023

[22] [22]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A. Smith, and Mike Lewis. 2022. https://doi.org/10.48550/arXiv.2108.12409 Train Short , Test Long : Attention with Linear Biases Enables Input Length Extrapolation . Preprint, arXiv:2108.12409

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.12409 2022

[23] [23]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1--67

2020

[24] [24]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 SQ u AD : 100,000+ questions for machine comprehension of text . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383--2392, Austin, Texas. Association for Computational Linguistics

work page doi:10.18653/v1/d16-1264 2016

[25] [25]

Andrew Rosenberg and Julia Hirschberg. 2007. https://aclanthology.org/D07-1043/ V -measure: A conditional entropy-based external cluster evaluation measure . In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 410--420, Prague, Czech Republi...

2007

[26] [26]

Noam Shazeer. 2020. https://doi.org/10.48550/arXiv.2002.05202 GLU Variants Improve Transformer . Preprint, arXiv:2002.05202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2002.05202 2020

[27] [27]

Jiajun Song and Yiqiao Zhong. 2024. https://doi.org/10.48550/arXiv.2310.04861 Uncovering hidden geometry in Transformers via disentangling position and context . Preprint, arXiv:2310.04861

work page doi:10.48550/arxiv.2310.04861 2024

[28] [28]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2023. https://doi.org/10.48550/arXiv.2104.09864 RoFormer : Enhanced Transformer with Rotary Position Embedding . Preprint, arXiv:2104.09864

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.09864 2023

[29] [29]

Felipe Urrutia, Jorge Salas, Alexander Kozachinskiy, Cristian Buc Calderon, Hector Pasten, and Cristobal Rojas. 2025. https://arxiv.org/abs/2511.11579 Decoupling positional and symbolic attention behavior in transformers . Preprint, arXiv:2511.11579

work page arXiv 2025

[30] [30]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need . In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc

2017

[31] [31]

Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, and Iryna Gurevych. 2024. https://doi.org/10.1162/tacl_a_00718 Holmes: A Benchmark to Assess the Linguistic Competence of Language Models . Transactions of the Association for Computational Linguistics, 12:1616--1647

work page doi:10.1162/tacl_a_00718 2024

[32] [32]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2018. https://doi.org/10.18653/v1/W18-5446 GLUE : A multi-task benchmark and analysis platform for natural language understanding . In Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pages 353--355, Brussels,...

work page doi:10.18653/v1/w18-5446 2018

[33] [33]

Yu-An Wang and Yun-Nung Chen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.555 What do position embeddings learn? an empirical study of pre-trained language model positional encoding . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6840--6849, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.555 2020

[34] [34]

Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. 2024. https://doi.org/10.48550/arXiv.2404.15574 Retrieval Head Mechanistically Explains Long-Context Factuality . Preprint, arXiv:2404.15574

work page doi:10.48550/arxiv.2404.15574 2024

[35] [35]

Zijun Wu, Anup Anand Deshmukh, Yongkang Wu, Jimmy Lin, and Lili Mou. 2025. https://doi.org/10.1162/coli_a_00545 The emergence of chunking structures with hierarchical RNN . Computational Linguistics, 51(3):815--841

work page doi:10.1162/coli_a_00545 2025

[36] [36]

Biao Zhang and Rico Sennrich. 2019. Root mean square layer normalization. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc

2019

[37] [37]

Chuanyang Zheng, Yihang Gao, Han Shi, Minbin Huang, Jingyao Li, Jing Xiong, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and 1 others. 2024. Dape: Data-adaptive positional encoding for length extrapolation. Advances in Neural Information Processing Systems, 37:26659--26700

2024

[38] [38]

Chuanyang Zheng, Yihang Gao, Han Shi, Jing Xiong, Jiankai Sun, Jingyao Li, Minbin Huang, Xiaozhe Ren, Michael Ng, Xin Jiang, Zhenguo Li, and Yu Li. 2025. https://doi.org/10.18653/v1/2025.acl-long.522 DAPE v2: Process attention score as feature map for length extrapolation . In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin...

work page doi:10.18653/v1/2025.acl-long.522 2025